2023 年度実績報告書

Constructing Compressed Indexes for Biological Sequences

公募研究

研究領域	社会変革の源泉となる革新的アルゴリズム基盤の創出と体系化
研究課題/領域番号	23H04378
研究機関	山梨大学
研究代表者	Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)
研究期間 (年度)	2023-04-01 – 2026-03-31
キーワード	data compression / text indexing / resource constraints / matching statistics / compressed indexes / positional BWT / LZ78 factorization / Wheeler DFAs
研究実績の概要	During fiscal year 2023, we worked on variations of the Burrows-Wheeler transform, one built on a grammar, the positional one, and the Wheeler graph. Firstly, by storing additionally to the FM-index the index of our DCC'22 paper, we made use of both to accelerate counting queries for all pattern lengths at the expense of more space compared to the FM-index. While the FM-index matches a pattern character-wise, we can switch to the DCC'22-version matching blocks of characters of the pattern defined by the grammar. Second, we proposed space-efficient data structures that augment the positional Burrows-Wheeler transform for efficiently finding set-maximal exact matches, and compare these with the baseline approach, which uses the plain divergence array. Thirdly, we worked on matching statistics on Wheeler DFAs, which allows us to match a pattern with multiple reference genome represented by a de-Brujin graph efficiently. Known results use a plain longest common prefix array, which takes space linear to the number of states. We proposed a space-efficient representation that requires a linear number in bits with logarithmic access time. We also give matching statistics computation as an application, which we now can do with a time-space trade-off. As a side-result, we worked on the substring compression problem for derivates of the LZ78 factorization, which seem to be practically relevant. Here, we used the suffix tree as an index to quickly compute an LZ78-kind factorization of a queried substring range quickly.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 We conducted the research for the fiscal year 2023 as planned, and can continue with the research for the fiscal year 2024 as highlighted in the research plan.
今後の研究の推進方策	We continue our investigation in processing and managing vast amounts of data arising in bioinformatics, thanks to the proliferation of low-cost sequencing technology. Having established some theoretical background for compression techniques during the previous fiscal year, and introduced practical applications of the positional Burrows-Wheeler transform within the realm of bioinformatics, we are now delving deeper into the findings we shared at DCC'24 and exploring variations of our problem settings. Our primary focus remains on constructing indexes utilizing the Burrows-Wheeler transform and data compression techniques within compressed spaces. Our main target is the efficient indexing of the data in bioinformatics, which aligns with the goals of our research project. In particular, we are striving towards publishing our WABI'23 paper in a journal, which introduced an FM-index capable of faster pattern matching by incorporating insights from our DCC'22 paper. In this endeavor, we are replacing grammar compression with prefix-free parsing (PFP). Presently, our implementation relies on a plain Burrows-Wheeler transform (BWT), resulting in a larger memory footprint compared to a standard FM-index, albeit with quicker query times. Switching to run-length compression and fine-tuning the parameters of PFP should lead to significant improvements in memory utilization. Additionally, we are expanding upon our findings from DCC'24 concerning the computation of LZ78 derivatives from suffix trees to compressed indexes.

研究成果
(10件)

すべて 2024 2023 その他

すべて国際共同研究 (3件) 雑誌論文 (5件) (うち国際共著 5件、査読あり 5件、オープンアクセス 2件) 学会発表 (1件) 備考 (1件)

[国際共同研究] University of Florida(米国)
- 国名
  米国
- 外国機関名
  University of Florida
[国際共同研究] Dalhousie University(カナダ)
- 国名
  カナダ
- 外国機関名
  Dalhousie University
[国際共同研究] Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca(イタリア)
- 国名
  イタリア
- 外国機関名
  Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca
[雑誌論文] Computing LZ78-Derivates with Suffix Trees2024
- 著者名/発表者名
  Dominik Koeppl
- 雑誌名
  
  Proceedings of DCC
  
  巻: - ページ: 133-142
- 査読あり / 国際共著
[雑誌論文] mu-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data2023
- 著者名/発表者名
  Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik Koeppl and Christina Boucher and Paola Bonizzoni
- 雑誌名
  
  Bioinformatics
  
  巻: 39 ページ: 74:1-74:20
- DOI
  10.1093/bioinformatics/btad552
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] Acceleration of FM-Index Queries Through Prefix-Free Parsing2023
- 著者名/発表者名
  Aaron Hong and Marco Oliva and Dominik Koeppl and Hideo Bannai and Christina Boucher and Travis Gagie
- 雑誌名
  
  Proceedings of WABI
  
  巻: 273 ページ: 13:1-13:16
- DOI
  10.4230/LIPIcs.WABI.2023.13
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] Space-time Trade-offs for the LCP Array of Wheeler DFAs2023
- 著者名/発表者名
  Nicola Cotumaccio and Travis Gagie and Dominik Koeppl and Nicola Prezza
- 雑誌名
  
  Proceedings of SPIRE
  
  巻: 14240 ページ: 143-156
- DOI
  10.1007/978-3-031-43980-3_12
- 査読あり / 国際共著
[雑誌論文] Data Structures for SMEM-Finding in the PBWT2023
- 著者名/発表者名
  Paola Bonizzoni and Christina Boucher and Davide Cozzi and Travis Gagie and Dominik Koeppl and Massimiliano Rossi
- 雑誌名
  
  Proceedings of SPIRE
  
  巻: 14240 ページ: 89-101
- DOI
  10.1007/978-3-031-43980-3_8
- 査読あり / 国際共著
[学会発表] LZD と LZMW 分解の部分文字列圧縮について2023
- 著者名/発表者名
  クップルドミニク
- 学会等名
  Local Proceedings of the 195th アルゴリズム研究会
[備考] personal homepage
- URL
  https://dkppl.de/

2023 年度 実績報告書

Constructing Compressed Indexes for Biological Sequences

研究代表者

Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)

現在までの達成度 (区分)

理由

研究成果

[国際共同研究] University of Florida(米国)

国名

外国機関名

[国際共同研究] Dalhousie University(カナダ)

国名

外国機関名

[国際共同研究] Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca(イタリア)

国名

外国機関名

[雑誌論文] Computing LZ78-Derivates with Suffix Trees2024

著者名/発表者名

雑誌名

[雑誌論文] mu-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data2023

著者名/発表者名

雑誌名

DOI

[雑誌論文] Acceleration of FM-Index Queries Through Prefix-Free Parsing2023

著者名/発表者名

雑誌名

DOI

[雑誌論文] Space-time Trade-offs for the LCP Array of Wheeler DFAs2023

著者名/発表者名

雑誌名

DOI

[雑誌論文] Data Structures for SMEM-Finding in the PBWT2023

著者名/発表者名

雑誌名

DOI

[学会発表] LZD と LZMW 分解の部分文字列圧縮について2023

著者名/発表者名

学会等名

[備考] personal homepage

URL

2023 年度実績報告書