• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Constructing Compressed Indexes for Biological Sequences

Publicly Offered Research

Project AreaCreation and Organization of Innovative Algorithmic Foundations for Leading Social Innovations
Project/Area Number 23H04378
Research Category

Grant-in-Aid for Transformative Research Areas (A)

Allocation TypeSingle-year Grants
Review Section Transformative Research Areas, Section (IV)
Research InstitutionUniversity of Yamanashi

Principal Investigator

Koeppl Dominik  山梨大学, 大学院総合研究部, 特任准教授 (50897395)

Project Period (FY) 2023-04-01 – 2026-03-31
Project Status Granted (Fiscal Year 2025)
Budget Amount *help
¥5,200,000 (Direct Cost: ¥4,000,000、Indirect Cost: ¥1,200,000)
Fiscal Year 2024: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000)
Fiscal Year 2023: ¥2,600,000 (Direct Cost: ¥2,000,000、Indirect Cost: ¥600,000)
KeywordsBiological Data / Compressed Indexes / Parameterized Pattern / Privacy-Preserving / SAT/ASP-based / FM-index / Pattern Matching / Text Indexing / Burrows-Wheeler / Suffix Arrays / Arithmetic progression / Galois Words / Compression Sensitivity / text indexing / data compression / pattern matching / index construction / string algorithm / resource constraints / matching statistics / compressed indexes / positional BWT / LZ78 factorization / Wheeler DFAs / compressed indices / construction algorithms / r-index / compression algorithms / lossless compression
Outline of Research at the Start

Major breakthroughs in sequencing techniques facilitate the collection of large amounts of biological data. For these to be of value, we need means to store and analyze them. Here, compressed indices are prospective candidates for answering biologically meaningful queries while keeping the data in a maintainably-small compressed format. Nonetheless, even the construction of those indices is not well studied. In this project, we want to shed light on efficient ways in how to construct such indices and how to use them for the aforementioned queries.

Outline of Annual Research Achievements

During fiscal year 2024, we worked on several problems for string processing and compressed data structures. First, we extended our conference paper for arithmetically progressed suffix arrays by additionally analyzing the shapes of Burrows-Wheeler transforms of strings whose suffix arrays are arithmetically progressed. Moreover, we gave applications for Christoffel words, balanced words, and meta strings. Finally, we could extend our study on binary and ternary alphabets to general alphabets.Second, we delved into factorization on words, in particular Galois words, and indexing data structures that are based on such factorizations.For Galois words, we gave algorithms to determine whether a word is Galois, to factorize a non-Galois word into Galois words uniquely like the Lyndon factorization, and to find the rotation of a word that is Galois. All algorithms work in linear time, and paved the way for indexing data structures based on the Galois factorization.For an indexing data structure based on the Lyndon factorization, the bijective Burrows-Wheeler transform (BWT), we studied its compression sensitivity when editing a single character of the input. Like previous work based on the BWT, we obtain a logarithmic multiplicative or a square-root additive change for specific cases.The compression sensitivity formalizes how small changes in input affect compression, providing bounds and theoretical ways to analyze compression stability.Together, these works deepen our understanding of compressed indexing, pattern matching, and combinatorial properties of strings.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

While this research project was planned initially to end in fiscal year 2024,
due to a break in fiscal year 2023, the research project has been extended to finally end in fiscal year 2025.
Within this updated research plan, we conducted the research as planned, and can start with the final phase of this project.

Strategy for Future Research Activity

For FY2025, we want to advance in four core goals of this project. First, to reduce the complexity of MAX-SAT encodings for smallest straight-line programs (SLPs) and bidirectional macro schemes (BMSs), two NP-hard but well-perceived problems in the data compression community, we address key bottlenecks: overlap checking in SLPs and transitivity in BMSs. We will apply recursive partitioning for SLPs and alternative encodings like leaf elimination for BMSs. For compression, we explore the diversity of Huffman coding trees producing the same codeword length distributions, starting with full-binary tree enumeration techniques. Next, we extend our parameterized Burrow-Wheeler transform (BWT) index, presented at DCC'24, to support online construction by combining techniques from extended BWT and parameterized BWTs. Efficient computation of matching statistics - crucial in bioinformatics - will also be prioritized. Parallel to that, for rare-pattern search, we enhance the tau-lambda index (DCC'24) using compressed data structures. We target both improved performance and better usability by replacing the original build process with efficient, compressed alternatives. Finally, in privacy-preserving string algorithms, we build on research that introduces special "hashtag" characters to hide sensitive content. Here, we want to transform these hashtags back into normal characters without reviving sensitive substrings. We conjecture the problem is polynomial-time solvable and aim to design an algorithm that performs this safely and efficiently.

Report

(2 results)
  • 2024 Annual Research Report
  • 2023 Annual Research Report
  • Research Products

    (28 results)

All 2025 2024 2023 Other

All Int'l Joint Research (3 results) Journal Article (13 results) (of which Int'l Joint Research: 13 results,  Peer Reviewed: 13 results,  Open Access: 8 results) Presentation (10 results) (of which Int'l Joint Research: 3 results) Remarks (2 results)

  • [Int'l Joint Research] University of Florida(米国)

    • Related Report
      2023 Annual Research Report
  • [Int'l Joint Research] Dalhousie University(カナダ)

    • Related Report
      2023 Annual Research Report
  • [Int'l Joint Research] Ca' Foscari University of Venice/Gran Sasso Science Institute/University of Milano Bicocca(イタリア)

    • Related Report
      2023 Annual Research Report
  • [Journal Article] Compression Sensitivity of the Burrows-Wheeler transform and its Bijective Variant2025

    • Author(s)
      Hyodam Jeon and Dominik Koeppl
    • Journal Title

      Mathematics

      Volume: 13(7) Issue: 7 Pages: 1-46

    • DOI

      10.3390/math13071070

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Pfp-fm: an accelerated FM-index2024

    • Author(s)
      Hong Aaron、Oliva Marco、Koppl Dominik、Bannai Hideo、Boucher Christina、Gagie Travis
    • Journal Title

      Algorithms for Molecular Biology

      Volume: 19 Issue: 1 Pages: 1-14

    • DOI

      10.1186/s13015-024-00260-8

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] On arithmetically progressed suffix arrays and related Burrows-Wheeler transforms2024

    • Author(s)
      Jacqueline W. Daykin and Dominik Koeppl and David Kuebel and Florian Stober
    • Journal Title

      Discrete Applied Mathematics

      Volume: 355 Pages: 180-199

    • DOI

      10.1016/j.dam.2024.04.009

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Breaking a Barrier in Constructing Compact Indexes for Parameterized Pattern Matching2024

    • Author(s)
      Kento Iseri and Tomohiro I and Diptarama Hendrian and Dominik Koeppl and Ryo Yoshinaka and Ayumi Shinohara
    • Journal Title

      Proceedings of ICALP

      Volume: 297

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Algorithms for Galois Words: Detection, Factorization, and Rotation2024

    • Author(s)
      Diptarama Hendrian and Dominik Koeppl and Ryo Yoshinaka and Ayumi Shinohara
    • Journal Title

      Proceedings of CPM

      Volume: 296

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] LZ78 Substring Compression with CDAWGs2024

    • Author(s)
      Hiroki Shibata and Dominik Koeppl
    • Journal Title

      Proceedings of SPIRE

      Volume: 14899 Pages: 289-305

    • DOI

      10.1007/978-3-031-72200-4_22

    • ISBN
      9783031721991, 9783031722004
    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Bijective BWT based Compression Schemes2024

    • Author(s)
      Golnaz Badkobeh and Hideo Bannai and Dominik Koeppl
    • Journal Title

      Proceedings of SPIRE

      Volume: 14899 Pages: 16-25

    • DOI

      10.1007/978-3-031-72200-4_2

    • ISBN
      9783031721991, 9783031722004
    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Edit and Alphabet-Ordering Sensitivity of Lex-Parse2024

    • Author(s)
      Yuto Nakashima and Dominik Koeppl and Mitsuru Funakoshi and Shunsuke Inenaga and Hideo Bannai
    • Journal Title

      Proceedings of MFCS

      Volume: 306

    • Related Report
      2024 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Computing LZ78-Derivates with Suffix Trees2024

    • Author(s)
      Dominik Koeppl
    • Journal Title

      Proceedings of DCC

      Volume: - Pages: 133-142

    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] mu-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data2023

    • Author(s)
      Davide Cozzi and Massimiliano Rossi and Simone Rubinacci and Travis Gagie and Dominik Koeppl and Christina Boucher and Paola Bonizzoni
    • Journal Title

      Bioinformatics

      Volume: 39 Issue: 9

    • DOI

      10.1093/bioinformatics/btad552

    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Acceleration of FM-Index Queries Through Prefix-Free Parsing2023

    • Author(s)
      Aaron Hong and Marco Oliva and Dominik Koeppl and Hideo Bannai and Christina Boucher and Travis Gagie
    • Journal Title

      Proceedings of WABI

      Volume: 273

    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Space-time Trade-offs for the LCP Array of Wheeler DFAs2023

    • Author(s)
      Nicola Cotumaccio and Travis Gagie and Dominik Koeppl and Nicola Prezza
    • Journal Title

      Proceedings of SPIRE

      Volume: 14240 Pages: 143-156

    • DOI

      10.1007/978-3-031-43980-3_12

    • ISBN
      9783031439797, 9783031439803
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Data Structures for SMEM-Finding in the PBWT2023

    • Author(s)
      Paola Bonizzoni and Christina Boucher and Davide Cozzi and Travis Gagie and Dominik Koeppl and Massimiliano Rossi
    • Journal Title

      Proceedings of SPIRE

      Volume: 14240 Pages: 89-101

    • DOI

      10.1007/978-3-031-43980-3_8

    • ISBN
      9783031439797, 9783031439803
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Presentation] On Solving the Sparse Matrix Compression Problem Greedily2025

    • Author(s)
      Dominik Koeppl and Vincent Limouzy and Andrea Marino and Giulia Punzi and Takeaki Uno
    • Organizer
      Local Proceedings of the LA Symposium Winter 2024
    • Related Report
      2024 Annual Research Report
  • [Presentation] On Solving the Sparse Matrix Compression Problem Greedily2025

    • Author(s)
      Dominik Koeppl and Vincent Limouzy and Andrea Marino and Giulia Punzi and Takeaki Uno
    • Organizer
      The 29th London Stringology Days and London Algorithmic Workshop
    • Related Report
      2024 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Compression Sensitivity of the Bijective Burrows--Wheeler transform2025

    • Author(s)
      Hyodam Jeon and Dominik Koeppl
    • Organizer
      International Workshop on Discrete Mathematics and Algorithms
    • Related Report
      2024 Annual Research Report
    • Int'l Joint Research
  • [Presentation] 全単射 Burrows--Wheeler 変換の圧縮感度について2025

    • Author(s)
      Hyodam Jeon and Dominik Koeppl
    • Organizer
      Local Proceedings of the 202th アルゴリズム研究会
    • Related Report
      2024 Annual Research Report
  • [Presentation] 対数的な幅を持つ疎行列圧縮のNP完全性2024

    • Author(s)
      坂内 英夫 and 後藤 啓介 and 田 峻介 and クップル ドミニク
    • Organizer
      Local Proceedings of the LA Symposium Summer 2024
    • Related Report
      2024 Annual Research Report
  • [Presentation] CDAWG による LZ78 部分文字列圧縮2024

    • Author(s)
      柴田 紘希 and クップル ドミニク
    • Organizer
      Local Proceedings of the LA Symposium Summer 2024
    • Related Report
      2024 Annual Research Report
  • [Presentation] LZ78 Substring Compression in CDAWG-compressed Space2024

    • Author(s)
      Hiroki Shibata and Dominik Koeppl
    • Organizer
      Local Proceedings of WAAC
    • Related Report
      2024 Annual Research Report
  • [Presentation] Overcoming boundaries of AI: future prospects2024

    • Author(s)
      Dominik Koeppl
    • Organizer
      EU-Japan AI Bridge -- Connecting International Researchers and the Japanese AI Start-up Scene
    • Related Report
      2024 Annual Research Report
  • [Presentation] Enumerating full binary trees in polynomial delay2024

    • Author(s)
      Yasuko Matsui and Hirotaka Ono and Dominik Koeppl
    • Organizer
      25th International Symposium on Mathematical Programming (ISMP)
    • Related Report
      2024 Annual Research Report
    • Int'l Joint Research
  • [Presentation] LZD と LZMW 分解の部分文字列圧縮について2023

    • Author(s)
      クップル ドミニク
    • Organizer
      Local Proceedings of the 195th アルゴリズム研究会
    • Related Report
      2023 Annual Research Report
  • [Remarks] Personal Homepage

    • URL

      https://dkppl.de/

    • Related Report
      2024 Annual Research Report
  • [Remarks] personal homepage

    • URL

      https://dkppl.de/

    • Related Report
      2023 Annual Research Report

URL: 

Published: 2023-04-13   Modified: 2025-12-26  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi