Developments with manipulating Lyndon factorizations using evolutionary computation methods
String factorization is an important tool for partitioning data for parallel processing and other algorithmic techniques often found in the context of big data applications such as bioinformatics or compression. Duval’s well-known linear algorithm uniquely factors a string over an ordered alphabet into Lyndon words, i.e., patterned strings which are strictly smaller than all of their cyclic rotations. While Duval’s algorithm produces a pre-determined factorization, modern applications motivate the demand for factorizations with specific properties, e.g., those that minimize/maximize the number of factors, or consist of factors with similar lengths. We have considered the problem of finding an alphabet ordering that yields a Lyndon factorization with such properties. For problems with no known heuristics, evolutionary computation methods may be useful, as such, we have produced a flexible evolutionary framework and evaluated it on biological sequence data. For the minimization case, we have proposed a new problem-specific heuristic, Flexi-Duval, and a problem-specific mutation operator for Lyndon factorization. We have shown that for individual amino acid sequences: very long (often single) Lyndon factors can be produced using Flexi-Duval, and in comparison with Duval’s algorithm, a much more balanced length for each Lyndon factor is achievable for specific texts. Further to this, we will consider the balancing problem for a much larger range of input texts; including random sequences, repetitions with some introduced noise, and natural language text corpora. We will consider improvements to the Flexi-Duval algorithm and evaluate it with the broader input texts, as well as considering the problem of maximizing the number of Lyndon factors via evolutionary computation methods.