Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures

Research Project

Project/Area Number	21K17701
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 60010:Theory of informatics-related
Research Institution	University of Yamanashi (2023) Tokyo Medical and Dental University (2021-2022)
Principal Investigator	Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)
Project Period (FY)	2021-04-01 – 2025-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000) Fiscal Year 2023: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000) Fiscal Year 2022: ¥2,340,000 (Direct Cost: ¥1,800,000、Indirect Cost: ¥540,000) Fiscal Year 2021: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
Keywords	compressed indexes / string subsequences / NP-hard problems / straight line programs / collage systems / block trees / parameterized BWT / pattern matching / data compression / matrix multiplication / matrix compression / subsequences / compact hashing / SIMD instructions / hybrid text indexes / compression techniques / indexing data structures / algorithm engineering / lossless compression / hybrid indexes
Outline of Research at the Start	With the increasing generation of massive datasets, there is a rising need in managing and analyzing these datasets efficiently. Our idea to meet this need is to leverage compression techniques to not only compress data but also process it in such a way that specific queries can be executed in reasonable time. We aim for practical and time-efficient compressed data structures that bridge the gap between traditional indexing solutions and compression techniques by embracing modern computer architectures.
Outline of Annual Research Achievements	Following the research plan outlined for fiscal year 2023, our primary focus was on extending string regularities from substrings to subsequences, exploring NP-hard problems associated with strings, and refining compressed indexing data structures. In the first thematic area, for computing the longest Lyndon subsequence, we achieved space and time bounds superior to those presented at IWOCA in 2022. Furthermore, we demonstrated methodologies for computing the longest bordered and periodic subsequences. This involved using novel tools to compute the longest common subsequences between all prefixes and suffixes of a text, which facilitated the computation of longest bordered or periodic subsequences. Asides, for the longest bordered subsequences, we established a conditional lower bound aligning with our quadratic running time. Subsequently, we delved into studying common NP-hard problems with strings as inputs, leveraging answer set programming solvers. Additionally, we proved the NP-hardness of finding the smallest run-length compressed straight-line programs (RLSLPs) for unbounded alphabet sizes. We could adapt this proof to finding the smallest collage system. Additionally, we devised a MAX-SAT encoding for computing the smallest RLSLP. In the final thematic area, we made advancements in the construction, practically for block trees and theoretically for the parameterized Burrows-Wheeler transform. For the latter, we also demonstrated that this transform can be adapted for circular pattern matching by changing the encoding.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason We conducted the research for the fiscal year 2023 as planned, and could complete most of our planned research at the end of the grant lifespan in the fiscal year 2023.
Strategy for Future Research Activity	As the grant's term ended in fiscal year 2023, we are now in the process of preparing to apply for a new grant for fiscal year 2025, based on the fact that this research has unveiled new paths for further exploration within the realm of string regularities and compressed indexes, igniting our enthusiasm to pursue these paths in the forthcoming years. While our main attention has been set to text indexing data structures for classic pattern matching, the exploration of extended pattern matching queries remains largely undone. In response, we aim to expand upon several concepts discovered during our recent research, combining them with cutting-edge indexing techniques tailored for classic pattern matching. We anticipate that these innovative indexing methodologies will find practical applications in scenarios where conventional pattern matching proves too restrictive, necessitating more adaptable matching criteria.

Report

(3 results)

Research Products
(56 results)

All 2024 2023 2022 2021 Other

All Int'l Joint Research (15 results) Journal Article (28 results) (of which Int'l Joint Research: 28 results, Peer Reviewed: 28 results, Open Access: 15 results) Presentation (10 results) (of which Int'l Joint Research: 1 results) Remarks (3 results)

[Int'l Joint Research] MPI Saarbruecken/Karlsruhe institute of technology/University of Muenster(ドイツ)
- Related Report
  2023 Research-status Report
[Int'l Joint Research] University of Helsinki(フィンランド)
- Related Report
  2023 Research-status Report
[Int'l Joint Research] Nicolaus Copernicus University in Torun(ポーランド)
- Related Report
  2023 Research-status Report
[Int'l Joint Research] Dalhousie University(カナダ)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] University of A Coruna(スペイン)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] University of Chile(チリ)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] Max Planck Institute for Informatics(ドイツ)
- Related Report
  2022 Research-status Report
[Int'l Joint Research] University of Helsinki(フィンランド)
- Related Report
  2022 Research-status Report
[Int'l Joint Research]
- Related Report
  2022 Research-status Report
[Int'l Joint Research] Travis Gagie(カナダ)
- Related Report
  2021 Research-status Report
[Int'l Joint Research] Nicola Prezza(イタリア)
- Related Report
  2021 Research-status Report
[Int'l Joint Research] Gonzalo Navarro(チリ)
- Related Report
  2021 Research-status Report
[Int'l Joint Research] Marcin Piatkowski(ポーランド)
- Related Report
  2021 Research-status Report
[Int'l Joint Research] Robert W. Irving/Lorna Love(英国)
- Related Report
  2021 Research-status Report
[Int'l Joint Research]
- Related Report
  2021 Research-status Report
[Journal Article] Computing Longest Lyndon Subsequences and Longest Common Lyndon Subsequences2024
- Author(s)
  Hideo Bannai and Tomohiro I and Tomasz Kociumaka and Dominik Koeppl and Simon J. Puglisi
- Journal Title
  
  Algorithmica
  
  Volume: 86 Issue: 3 Pages: 735-756
- DOI
  10.1007/s00453-023-01125-z
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Extending the Parameterized Burrows-Wheeler Transform2024
- Author(s)
  Eric M. Osterkamp and Dominik Koeppl
- Journal Title
  
  Proceedings of DCC
  
  Volume: - Pages: 143-152
- Related Report
  2023 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] On the Hardness of Smallest RLSLPs and Collage Systems2024
- Author(s)
  Akiyoshi Kawamoto and Tomohiro I and Dominik Koeppl and Hideo Bannai
- Journal Title
  
  Proceedings of DCC
  
  Volume: - Pages: 243-252
- Related Report
  2023 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Constructing and Indexing the Bijective and Extended Burrows-Wheeler Transform2024
- Author(s)
  Hideo Bannai and Juha Kaerkkaeinen and Dominik Koeppl and Marcin Piatkowski
- Journal Title
  
  Inf. Comput.
  
  Volume: 297 Pages: 1-30
- DOI
  10.1016/j.ic.2024.105153
- Related Report
  2023 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Encoding Hard String Problems with Answer Set Programming2023
- Author(s)
  Dominik Koeppl
- Journal Title
  
  Proceedings of CPM
  
  Volume: 259
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Longest bordered and periodic subsequences2023
- Author(s)
  Hideo Bannai and Tomohiro I and Dominik Koeppl
- Journal Title
  
  Inf. Process. Lett.
  
  Volume: 182 Pages: 1-6
- DOI
  10.1016/j.ipl.2023.106398
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Faster Block Tree Construction2023
- Author(s)
  Dominik Koeppl and Florian Kurpicz and Daniel Meyer
- Journal Title
  
  Proceedings of ESA
  
  Volume: 274
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Dynamic Skyline Computation with LSD Trees2023
- Author(s)
  Dominik Koeppl
- Journal Title
  
  Analytics
  
  Volume: 2 Issue: 1 Pages: 146-162
- DOI
  10.3390/analytics2010009
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Space-efficient Huffman codes revisited2023
- Author(s)
  Szymon Grabowski and Dominik Koeppl
- Journal Title
  
  Information Processing Letters
  
  Volume: 179 Pages: 1-8
- DOI
  10.1016/j.ipl.2022.106274
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Graph Compression for Adjacency-Matrix Multiplication2022
- Author(s)
  Alexandre P. Francisco and Travis Gagie and Dominik Koeppl and Susana Ladra and Gonzalo Navarro
- Journal Title
  
  SN Computer Science
  
  Volume: 3 Issue: 3 Pages: 1-8
- DOI
  10.1007/s42979-022-01084-2
- Related Report
  2022 Research-status Report 2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Computing Longest (Common) Lyndon Subsequences2022
- Author(s)
  Hideo Bannai, Tomohiro I, Tomasz Kociumaka, Dominik Koeppl, Simon J. Puglisi
- Journal Title
  
  Proc. 33rd International Workshop on Combinatorial Algorithms (IWOCA) 2022
  
  Volume: － Pages: 128-142
- DOI
  10.1007/978-3-031-06678-8_10
- ISBN
  9783031066771, 9783031066788
- Related Report
  2022 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Space-Efficient B Trees via Load-Balancing2022
- Author(s)
  Tomohiro I, Dominik Koeppl
- Journal Title
  
  Proc. 33rd International Workshop on Combinatorial Algorithms (IWOCA) 2022
  
  Volume: － Pages: 327-340
- DOI
  10.1007/978-3-031-06678-8_24
- ISBN
  9783031066771, 9783031066788
- Related Report
  2022 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Linking Off-Road Points to Routing Networks2022
- Author(s)
  Dominik Koeppl
- Journal Title
  
  Algorithms
  
  Volume: 15(5) Issue: 5 Pages: 1-15
- DOI
  10.3390/a15050163
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Fast and Simple Compact Hashing via Bucketing2022
- Author(s)
  Dominik Koeppl and Simon J. Puglisi and Rajeev Raman
- Journal Title
  
  Algorithmica
  
  Volume: 84 Issue: 9 Pages: 2735-2766
- DOI
  10.1007/s00453-022-00996-y
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Computing the Parameterized Burrows-Wheeler Transform Online2022
- Author(s)
  Daiki Hashimoto and Diptarama Hendrian and Dominik Koeppl and Ryo Yoshinaka and Ayumi Shinohara
- Journal Title
  
  Proceedings of SPIRE
  
  Volume: 13617 Pages: 70-85
- DOI
  10.1007/978-3-031-20643-6_6
- ISBN
  9783031206429, 9783031206436
- Related Report
  2022 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Accessing the Suffix Array via $\phi^-1$-Forest2022
- Author(s)
  Christina Boucher and Dominik Koeppl and Herman Perera and Massimiliano Rossi
- Journal Title
  
  Proceedings of SPIRE
  
  Volume: 13617 Pages: 86-98
- DOI
  10.1007/978-3-031-20643-6_7
- ISBN
  9783031206429, 9783031206436
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Computing NP-hard Repetitiveness Measures via MAX-SAT2022
- Author(s)
  Hideo Bannai and Keisuke Goto and Masakazu Ishihata and Shunsuke Kanda and Dominik Koeppl and Takaaki Nishimoto
- Journal Title
  
  Proceedings of ESA
  
  Volume: 244
- Related Report
  2022 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices2022
- Author(s)
  Paolo Ferragina and Giovanni Manzini and Travis Gagie and Dominik Koeppl and Gonzalo Navarro and Manuel Striani and Francesco Tosoni
- Journal Title
  
  Proc. VLDB
  
  Volume: 15 Issue: 10 Pages: 2175-2187
- DOI
  10.14778/3547305.3547321
- Related Report
  2022 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Inferring Spatial Distance Rankings with Partial Knowledge on Routing Networks2022
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Information
  
  Volume: 13 Issue: 4 Pages: 168-168
- DOI
  10.3390/info13040168
- Related Report
  2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Computing Lexicographic Parsings2022
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Proc. DCC
  
  Volume: 2022 Pages: 232-241
- DOI
  10.1109/dcc52660.2022.00031
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] HOLZ: High-Order Entropy Encoding of {Lempel--Ziv} Factor Distances2022
- Author(s)
  Dominik Koeppl and Gonzalo Navarro and Nicola Prezza
- Journal Title
  
  Proc. DCC
  
  Volume: 2022 Pages: 83-92
- DOI
  10.1109/dcc52660.2022.00016
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns2022
- Author(s)
  Jin Jie Deng and Wing-Kai Hon and Dominik Koeppl and Kunihiko Sadakane
- Journal Title
  
  Proc. DCC
  
  Volume: 83--92 Pages: 63-72
- DOI
  10.1109/dcc52660.2022.00014
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] c-trie++: A dynamic trie tailored for fast prefix searches2021
- Author(s)
  Kazuya Tsuruta, Dominik Koeppl, Shunsuke Kanda, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
- Journal Title
  
  Information and Computation
  
  Volume: - Pages: 104794-104794
- DOI
  10.1016/j.ic.2021.104794
- Related Report
  2022 Research-status Report 2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Reversed Lempel-Ziv Factorization with Suffix Trees2021
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Algorithms
  
  Volume: 14 Issue: 6 Pages: 161-161
- DOI
  10.3390/a14060161
- Related Report
  2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] A Separation of $$\gamma $$ and b via Thue-Morse Words2021
- Author(s)
  Bannai Hideo、Funakoshi Mitsuru、I Tomohiro、Koeppl Dominik、Mieno Takuya、Nishimoto Takaaki
- Journal Title
  
  Proceedings of the 28th International Symposium on String Processing and Information Retrieval (SPIRE 2021)
  
  Volume: LNCS 12944 Pages: 167-178
- DOI
  10.1007/978-3-030-86692-1_14
- ISBN
  9783030866914, 9783030866921
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Grammar Index by Induced Suffix Sorting2021
- Author(s)
  Tooru Akagi, Dominik Koeppl, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda
- Journal Title
  
  Proceedings of 28th International Symposium on String Processing and Information Retrieval
  
  Volume: 12944 Pages: 85-99
- DOI
  10.1007/978-3-030-86692-1_8
- ISBN
  9783030866914, 9783030866921
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Extracting the Sparse Longest Common Prefix Array from the Suffix Binary Search Tree2021
- Author(s)
  I Tomohiro、Irving Robert、Koeppl Dominik、Love Lorna
- Journal Title
  
  Proc. SPIRE
  
  Volume: 12944 Pages: 143-150
- DOI
  10.1007/978-3-030-86692-1_12
- ISBN
  9783030866914, 9783030866921
- Related Report
  2021 Research-status Report
- Peer Reviewed / Int'l Joint Research
[Journal Article] Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time2021
- Author(s)
  Bannai, Hideo and Kaerkkaeinen, Juha and Koeppl, Dominik and Piatkowski, Marcin
- Journal Title
  
  32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
  
  Volume: 191
- Related Report
  2021 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] Answer Set Programming を用いた圧縮指標の計算2024
- Author(s)
  クップルドミニク and 番原睦則
- Organizer
  Local Proceedings of the LA Symposium Winter 2023
- Related Report
  2023 Research-status Report
[Presentation] パラメタ化 Burrows-Wheeler 変換の拡張2023
- Author(s)
  Eric M. Osterkamp and Dominik Koeppl
- Organizer
  Local Proceedings of コンピュテーション研究会
- Related Report
  2023 Research-status Report
[Presentation] lex-parse の圧縮感度2023
- Author(s)
  中島祐人 and クップルドミニク and 舩越満 and 稲永俊介
- Organizer
  Local Proceedings of the 195th アルゴリズム研究会
- Related Report
  2023 Research-status Report
[Presentation] Encoding Hard String Problems with Answer Set Programming2023
- Author(s)
  Dominik Koeppl
- Organizer
  Sequences in London
- Related Report
  2023 Research-status Report
- Int'l Joint Research
[Presentation] ZDDを用いた最小文字列アトラクタの列挙2023
- Author(s)
  藤岡祐太 and 斎藤寿樹 and クップルドミニク
- Organizer
  日本オペレーションズ・リサーチ学会九州支部九州地区におけるOR若手研究交流会
- Related Report
  2023 Research-status Report
[Presentation] r インデックスにおける接尾辞配列を模倣するデータ構造2023
- Author(s)
  Christina Boucher and Dominik Koeppl and Herman Perera and Massimiliano Rossi
- Organizer
  Local Proceedings of the LA Symposium Winter 2022
- Related Report
  2022 Research-status Report
[Presentation] アルファベット順による lex-parse サイズ比2023
- Author(s)
  中島祐人 and クップルドミニク and 舩越満 and 稲永俊介
- Organizer
  Local Proceedings of the 191th アルゴリズム研究会
- Related Report
  2022 Research-status Report
[Presentation] 接尾辞木に基づくLZ77とLPF配列の変種の計算2022
- Author(s)
  クップルドミニク
- Organizer
  Local Proceedings of コンピュテーション研究会
- Related Report
  2022 Research-status Report
[Presentation] Lempel-Ziv 項の距離を高次情報量で表現する符号2022
- Author(s)
  Dominik Koeppl and Gonzalo Navarro and Nicola Prezza
- Organizer
  Local Proceedings of the 190th アルゴリズム研究会
- Related Report
  2022 Research-status Report
[Presentation] 省領域な lexicographic parse 構築アルゴリズム2022
- Author(s)
  Koeppl Dominik
- Organizer
  COMP2021-28
- Related Report
  2021 Research-status Report
[Remarks] personal homepage
- URL
  https://dkppl.de/
- Related Report
  2023 Research-status Report
[Remarks] Personal Homepage
- URL
  https://dkppl.de/
- Related Report
  2022 Research-status Report
[Remarks] personal home page
- URL
  https://dkppl.de/
- Related Report
  2021 Research-status Report

Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures

Principal Investigator

Koeppl Dominik 山梨大学, 大学院総合研究部, 特任准教授 (50897395)

¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000)

Current Status of Research Progress

Reason

Report

Research Products

[Int'l Joint Research] MPI Saarbruecken/Karlsruhe institute of technology/University of Muenster(ドイツ)

Related Report

[Int'l Joint Research] University of Helsinki(フィンランド)

Related Report

[Int'l Joint Research] Nicolaus Copernicus University in Torun(ポーランド)

Related Report

[Int'l Joint Research] Dalhousie University(カナダ)

Related Report

[Int'l Joint Research] University of A Coruna(スペイン)

Related Report

[Int'l Joint Research] University of Chile(チリ)

Related Report

[Int'l Joint Research] Max Planck Institute for Informatics(ドイツ)

Related Report

[Int'l Joint Research] University of Helsinki(フィンランド)

Related Report

[Int'l Joint Research]

Related Report

[Int'l Joint Research] Travis Gagie(カナダ)

Related Report

[Int'l Joint Research] Nicola Prezza(イタリア)

Related Report

[Int'l Joint Research] Gonzalo Navarro(チリ)

Related Report

[Int'l Joint Research] Marcin Piatkowski(ポーランド)

Related Report

[Int'l Joint Research] Robert W. Irving/Lorna Love(英国)

Related Report

[Int'l Joint Research]

Related Report

[Journal Article] Computing Longest Lyndon Subsequences and Longest Common Lyndon Subsequences2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Extending the Parameterized Burrows-Wheeler Transform2024

Author(s)

Journal Title

Related Report

[Journal Article] On the Hardness of Smallest RLSLPs and Collage Systems2024

Author(s)

Journal Title

Related Report

[Journal Article] Constructing and Indexing the Bijective and Extended Burrows-Wheeler Transform2024

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Encoding Hard String Problems with Answer Set Programming2023

Author(s)

Journal Title

Related Report

[Journal Article] Longest bordered and periodic subsequences2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Faster Block Tree Construction2023

Author(s)

Journal Title

Related Report

[Journal Article] Dynamic Skyline Computation with LSD Trees2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Space-efficient Huffman codes revisited2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Graph Compression for Adjacency-Matrix Multiplication2022