| Project Area | Creation and Organization of Innovative Algorithmic Foundations for Leading Social Innovations |
| Project/Area Number |
23H04378
|
| Research Category |
Grant-in-Aid for Transformative Research Areas (A)
|
| Allocation Type | Single-year Grants |
| Review Section |
Transformative Research Areas, Section (IV)
|
| Research Institution | University of Yamanashi |
Principal Investigator |
Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)
|
| Project Period (FY) |
2023-04-01 – 2026-03-31
|
| Project Status |
Granted (Fiscal Year 2025)
|
| Budget Amount *help |
¥5,200,000 (Direct Cost: ¥4,000,000、Indirect Cost: ¥1,200,000)
Fiscal Year 2024: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000)
Fiscal Year 2023: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000)
|
| Keywords | Biological Data / Compressed Indexes / Parameterized Pattern / Privacy-Preserving / SAT/ASP-based / FM-index / Pattern Matching / Text Indexing / Burrows-Wheeler / Suffix Arrays / Arithmetic progression / Galois Words / Compression Sensitivity / text indexing / data compression / pattern matching / index construction / string algorithm / resource constraints / matching statistics / compressed indexes / positional BWT / LZ78 factorization / Wheeler DFAs / compressed indices / construction algorithms / r-index / compression algorithms / lossless compression |
| Outline of Research at the Start |
Major breakthroughs in sequencing techniques facilitate the collection of large amounts of biological data. For these to be of value, we need means to store and analyze them. Here, compressed indices are prospective candidates for answering biologically meaningful queries while keeping the data in a maintainably-small compressed format. Nonetheless, even the construction of those indices is not well studied. In this project, we want to shed light on efficient ways in how to construct such indices and how to use them for the aforementioned queries.
|
| Outline of Annual Research Achievements |
During fiscal year 2024, we worked on several problems for string processing and compressed data structures. First, we extended our conference paper for arithmetically progressed suffix arrays by additionally analyzing the shapes of Burrows-Wheeler transforms of strings whose suffix arrays are arithmetically progressed. Moreover, we gave applications for Christoffel words, balanced words, and meta strings. Finally, we could extend our study on binary and ternary alphabets to general alphabets.Second, we delved into factorization on words, in particular Galois words, and indexing data structures that are based on such factorizations.For Galois words, we gave algorithms to determine whether a word is Galois, to factorize a non-Galois word into Galois words uniquely like the Lyndon factorization, and to find the rotation of a word that is Galois. All algorithms work in linear time, and paved the way for indexing data structures based on the Galois factorization.For an indexing data structure based on the Lyndon factorization, the bijective Burrows-Wheeler transform (BWT), we studied its compression sensitivity when editing a single character of the input. Like previous work based on the BWT, we obtain a logarithmic multiplicative or a square-root additive change for specific cases.The compression sensitivity formalizes how small changes in input affect compression, providing bounds and theoretical ways to analyze compression stability.Together, these works deepen our understanding of compressed indexing, pattern matching, and combinatorial properties of strings.
|
| Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
While this research project was planned initially to end in fiscal year 2024, due to a break in fiscal year 2023, the research project has been extended to finally end in fiscal year 2025. Within this updated research plan, we conducted the research as planned, and can start with the final phase of this project.
|
| Strategy for Future Research Activity |
For FY2025, we want to advance in four core goals of this project. First, to reduce the complexity of MAX-SAT encodings for smallest straight-line programs (SLPs) and bidirectional macro schemes (BMSs), two NP-hard but well-perceived problems in the data compression community, we address key bottlenecks: overlap checking in SLPs and transitivity in BMSs. We will apply recursive partitioning for SLPs and alternative encodings like leaf elimination for BMSs. For compression, we explore the diversity of Huffman coding trees producing the same codeword length distributions, starting with full-binary tree enumeration techniques. Next, we extend our parameterized Burrow-Wheeler transform (BWT) index, presented at DCC'24, to support online construction by combining techniques from extended BWT and parameterized BWTs. Efficient computation of matching statistics - crucial in bioinformatics - will also be prioritized. Parallel to that, for rare-pattern search, we enhance the tau-lambda index (DCC'24) using compressed data structures. We target both improved performance and better usability by replacing the original build process with efficient, compressed alternatives. Finally, in privacy-preserving string algorithms, we build on research that introduces special "hashtag" characters to hide sensitive content. Here, we want to transform these hashtags back into normal characters without reviving sensitive substrings. We conjecture the problem is polynomial-time solvable and aim to design an algorithm that performs this safely and efficiently.
|