研究実績の概要 |
Striving for improvements in factorization algorithms and text indexing within resource-constraint environments, we gained more insights in both topics. For the first one (factorization algorithms), we practically improved the computation of the LZ78 parsing in low-memory by using algorithmically-engineered trie data structures. The main idea was to leverage compact hashing techniques. We also showed that we can improve the memory if we are allowed to output a variation of the factorization storing a compressed version of a hash table. We later also studied the computation of lexicographic parsings, which depend on the order of the suffixes in the text. There, we proposed a sparse Phi array that stores enough information to represent the whole suffix array. While restoring the suffix array from the sparse Phi array seems to be inefficient, the storage layout of this small data structure is enough to compute efficiently lexicographic parsings that use lexicographically-neighboring suffixes as references. For the second topic (working with sparse or compressed indexes), we reviewed the suffix binary search tree, a balanced search tree maintaining the order of designated suffixes, as a sparse indexing data structure capable for extracting the sparse suffix array and the sparse longest common prefix array. We also devised an indexing data structure built on top of a grammar to accelerate pattern matching by scanning for non-terminals covering several up to many terminal symbols instead of just single terminal symbols.
|
今後の研究の推進方策 |
The research spawned several questions we want to investigate in the future: For the LZ78 trie computation, we showed how to also compute the LZW factorization, which is a practical variation of the LZ78 factorization. However, there is actually a family of LZ78-like factorizations, including LZD and LZMW, for which no such space-efficient algorithm yet exists. We ask to what extend we can generalize our techniques for computing other such kinds of factorizations. Regarding the proposed sparse Phi array representation, we have left its construction as an open problem. While a naive construction is straight-forward, a space-efficient construction seems to put a burden on the time. Advances in the r-index data structure have led to alternative representations of the Phi array, which seem to be good candidates for studying construction techniques. Finally, for the proposed index on grammar-compressed texts, we wonder whether we can attain a space/time trade-off by using grammars that improve locality-sensitivity by the expense of storing more information. Several other open problems related to the efficient construction of useful data structures such as the sparse Phi array pushed us to the proposition of an extension of this research project, which led to a new grant entitled "Constructing Compressed Indexes for Biological Sequences" with grant number JP23H04378.
|