Constructing Compressed Indexes for Biological Sequences

Publicly Offered Research

Project Area	Creation and Organization of Innovative Algorithmic Foundations for Leading Social Innovations
Project/Area Number	23H04378
Research Category	Grant-in-Aid for Transformative Research Areas (A)
Allocation Type	Single-year Grants
Review Section	Transformative Research Areas, Section (IV)
Research Institution	University of Yamanashi
Principal Investigator	Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)
Project Period (FY)	2023-04-01 – 2026-03-31
Project Status	Granted (Fiscal Year 2024)
Budget Amount *help	¥5,200,000 (Direct Cost: ¥4,000,000、Indirect Cost: ¥1,200,000) Fiscal Year 2024: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000) Fiscal Year 2023: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000)
Keywords	text indexing / data compression / pattern matching / index construction / string algorithm / resource constraints / matching statistics / compressed indexes / positional BWT / LZ78 factorization / Wheeler DFAs / compressed indices / construction algorithms / r-index / compression algorithms / lossless compression
Outline of Research at the Start	Major breakthroughs in sequencing techniques facilitate the collection of large amounts of biological data. For these to be of value, we need means to store and analyze them. Here, compressed indices are prospective candidates for answering biologically meaningful queries while keeping the data in a maintainably-small compressed format. Nonetheless, even the construction of those indices is not well studied. In this project, we want to shed light on efficient ways in how to construct such indices and how to use them for the aforementioned queries.
Outline of Annual Research Achievements	During fiscal year 2023, we worked on variations of the Burrows-Wheeler transform, one built on a grammar, the positional one, and the Wheeler graph. Firstly, by storing additionally to the FM-index the index of our DCC'22 paper, we made use of both to accelerate counting queries for all pattern lengths at the expense of more space compared to the FM-index. While the FM-index matches a pattern character-wise, we can switch to the DCC'22-version matching blocks of characters of the pattern defined by the grammar. Second, we proposed space-efficient data structures that augment the positional Burrows-Wheeler transform for efficiently finding set-maximal exact matches, and compare these with the baseline approach, which uses the plain divergence array. Thirdly, we worked on matching statistics on Wheeler DFAs, which allows us to match a pattern with multiple reference genome represented by a de-Brujin graph efficiently. Known results use a plain longest common prefix array, which takes space linear to the number of states. We proposed a space-efficient representation that requires a linear number in bits with logarithmic access time. We also give matching statistics computation as an application, which we now can do with a time-space trade-off. As a side-result, we worked on the substring compression problem for derivates of the LZ78 factorization, which seem to be practically relevant. Here, we used the suffix tree as an index to quickly compute an LZ78-kind factorization of a queried substring range quickly.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason We conducted the research for the fiscal year 2023 as planned, and can continue with the research for the fiscal year 2024 as highlighted in the research plan.
Strategy for Future Research Activity	We continue our investigation in processing and managing vast amounts of data arising in bioinformatics, thanks to the proliferation of low-cost sequencing technology. Having established some theoretical background for compression techniques during the previous fiscal year, and introduced practical applications of the positional Burrows-Wheeler transform within the realm of bioinformatics, we are now delving deeper into the findings we shared at DCC'24 and exploring variations of our problem settings. Our primary focus remains on constructing indexes utilizing the Burrows-Wheeler transform and data compression techniques within compressed spaces. Our main target is the efficient indexing of the data in bioinformatics, which aligns with the goals of our research project. In particular, we are striving towards publishing our WABI'23 paper in a journal, which introduced an FM-index capable of faster pattern matching by incorporating insights from our DCC'22 paper. In this endeavor, we are replacing grammar compression with prefix-free parsing (PFP). Presently, our implementation relies on a plain Burrows-Wheeler transform (BWT), resulting in a larger memory footprint compared to a standard FM-index, albeit with quicker query times. Switching to run-length compression and fine-tuning the parameters of PFP should lead to significant improvements in memory utilization. Additionally, we are expanding upon our findings from DCC'24 concerning the computation of LZ78 derivatives from suffix trees to compressed indexes.

Report

(1 results)

2023 Annual Research Report

Research Products
(10 results)

All 2024 2023 Other

All Int'l Joint Research (3 results) Journal Article (5 results) (of which Int'l Joint Research: 5 results, Peer Reviewed: 5 results, Open Access: 2 results) Presentation (1 results) Remarks (1 results)

[Int'l Joint Research] University of Florida(米国)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] Dalhousie University(カナダ)
- Related Report
  2023 Annual Research Report
[Int'l Joint Research] Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca(イタリア)
- Related Report
  2023 Annual Research Report
[Journal Article] Computing LZ78-Derivates with Suffix Trees2024
- Author(s)
  Dominik Koeppl
- Journal Title
  
  Proceedings of DCC
  
  Volume: - Pages: 133-142
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] mu-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data2023
- Author(s)
  Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik Koeppl and Christina Boucher and Paola Bonizzoni
- Journal Title
  
  Bioinformatics
  
  Volume: 39 Issue: 9
- DOI
  10.1093/bioinformatics/btad552
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Acceleration of FM-Index Queries Through Prefix-Free Parsing2023
- Author(s)
  Aaron Hong and Marco Oliva and Dominik Koeppl and Hideo Bannai and Christina Boucher and Travis Gagie
- Journal Title
  
  Proceedings of WABI
  
  Volume: 273
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Space-time Trade-offs for the LCP Array of Wheeler DFAs2023
- Author(s)
  Nicola Cotumaccio and Travis Gagie and Dominik Koeppl and Nicola Prezza
- Journal Title
  
  Proceedings of SPIRE
  
  Volume: 14240 Pages: 143-156
- DOI
  10.1007/978-3-031-43980-3_12
- ISBN
  9783031439797, 9783031439803
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Data Structures for SMEM-Finding in the PBWT2023
- Author(s)
  Paola Bonizzoni and Christina Boucher and Davide Cozzi and Travis Gagie and Dominik Koeppl and Massimiliano Rossi
- Journal Title
  
  Proceedings of SPIRE
  
  Volume: 14240 Pages: 89-101
- DOI
  10.1007/978-3-031-43980-3_8
- ISBN
  9783031439797, 9783031439803
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Presentation] LZD と LZMW 分解の部分文字列圧縮について2023
- Author(s)
  クップルドミニク
- Organizer
  Local Proceedings of the 195th アルゴリズム研究会
- Related Report
  2023 Annual Research Report
[Remarks] personal homepage
- URL
  https://dkppl.de/
- Related Report
  2023 Annual Research Report

Constructing Compressed Indexes for Biological Sequences

Principal Investigator

Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)

¥5,200,000 (Direct Cost: ¥4,000,000、Indirect Cost: ¥1,200,000)

Current Status of Research Progress

Reason

Report

Research Products

[Int'l Joint Research] University of Florida(米国)

Related Report

[Int'l Joint Research] Dalhousie University(カナダ)

Related Report

[Int'l Joint Research] Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca(イタリア)

Related Report

[Journal Article] Computing LZ78-Derivates with Suffix Trees2024

Author(s)

Journal Title

Related Report

[Journal Article] mu-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Acceleration of FM-Index Queries Through Prefix-Free Parsing2023

Author(s)

Journal Title

Related Report

[Journal Article] Space-time Trade-offs for the LCP Array of Wheeler DFAs2023

Author(s)

Journal Title

DOI

ISBN

Related Report

[Journal Article] Data Structures for SMEM-Finding in the PBWT2023

Author(s)

Journal Title

DOI

ISBN

Related Report

[Presentation] LZD と LZMW 分解の部分文字列圧縮について2023

Author(s)

Organizer

Related Report

[Remarks] personal homepage

URL

Related Report