2021 Fiscal Year Annual Research Report
Resource-Constraint Privacy-Aware Data Structures Tackling Problems in Bioinformatics
Publicly Offered Research
Project Area | Creation and Organization of Innovative Algorithmic Foundations for Leading Social Innovations |
Project/Area Number |
21H05847
|
Research Institution | Tokyo Medical and Dental University |
Principal Investigator |
Koeppl Dominik 東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)
|
Project Period (FY) |
2021-09-10 – 2023-03-31
|
Keywords | factorization algorithms / LZ78 compression / lexicographic parse / sparse suffix sorting / grammar compression / compressed data / memory-efficiency / hashing |
Outline of Annual Research Achievements |
Striving for improvements in factorization algorithms and text indexing within resource-constraint environments, we gained more insights in both topics. For the first one (factorization algorithms), we practically improved the computation of the LZ78 parsing in low-memory by using algorithmically-engineered trie data structures. The main idea was to leverage compact hashing techniques. We also showed that we can improve the memory if we are allowed to output a variation of the factorization storing a compressed version of a hash table. We later also studied the computation of lexicographic parsings, which depend on the order of the suffixes in the text. There, we proposed a sparse Phi array that stores enough information to represent the whole suffix array. While restoring the suffix array from the sparse Phi array seems to be inefficient, the storage layout of this small data structure is enough to compute efficiently lexicographic parsings that use lexicographically-neighboring suffixes as references. For the second topic (working with sparse or compressed indexes), we reviewed the suffix binary search tree, a balanced search tree maintaining the order of designated suffixes, as a sparse indexing data structure capable for extracting the sparse suffix array and the sparse longest common prefix array. We also devised an indexing data structure built on top of a grammar to accelerate pattern matching by scanning for non-terminals covering several up to many terminal symbols instead of just single terminal symbols.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
We conducted the research for the fiscal year 2021 as planned, and could complete most of our planned research at the end of the grant lifespan in the fiscal year 2022.
|
Strategy for Future Research Activity |
The research spawned several questions we want to investigate in the future: For the LZ78 trie computation, we showed how to also compute the LZW factorization, which is a practical variation of the LZ78 factorization. However, there is actually a family of LZ78-like factorizations, including LZD and LZMW, for which no such space-efficient algorithm yet exists. We ask to what extend we can generalize our techniques for computing other such kinds of factorizations. Regarding the proposed sparse Phi array representation, we have left its construction as an open problem. While a naive construction is straight-forward, a space-efficient construction seems to put a burden on the time. Advances in the r-index data structure have led to alternative representations of the Phi array, which seem to be good candidates for studying construction techniques. Finally, for the proposed index on grammar-compressed texts, we wonder whether we can attain a space/time trade-off by using grammars that improve locality-sensitivity by the expense of storing more information. Several other open problems related to the efficient construction of useful data structures such as the sparse Phi array pushed us to the proposition of an extension of this research project, which led to a new grant entitled "Constructing Compressed Indexes for Biological Sequences" with grant number JP23H04378.
|
Research Products
(19 results)