2021 Fiscal Year Research-status Report

Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures

Research Project

Project/Area Number	21K17701
Research Institution	Tokyo Medical and Dental University
Principal Investigator	Koeppl Dominik 東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)
Project Period (FY)	2021-04-01 – 2024-03-31
Keywords	compression techniques / indexing data structures / matrix compression / algorithm engineering
Outline of Annual Research Achievements	This research lies in the intersection of compression techniques, indexing data structures, and algorithm engineering on modern computer architectures. Regarding compression, we proposed a space-efficient algorithm computing the reversed LZ-factorization in linear time. This algorithm can be modified to compute the longest previous non-overlapping reverse factor table. Next, we proposed a new representation of Lempel-Ziv 77 factors, which are usually represented by pairs of text offset and length. By exchanging the text offsets with the offsets within the list of co-lexicographically sorted prefixes read up to the starting position of the respective factor, we could empirically observe that these offsets tend to be smaller than the text offsets, improving the final compressed size when coding the pairs with a universal coder. Finally, we provided a linear-time construction algorithm of the bijective Burrows-Wheeler transform (BBWT), which can be used for data compression and for compressed text indexes. Speaking about compressed text indexes, we presented text indexes built on grammars based on suffix sorting, and showed that this grammar exhibits locality sensitive properties such that finding a pattern in the text can be done efficiently by constructing the same grammar on the pattern, and searching the non-terminals of the pattern in the grammar tree of the text. Finally, we departed from one-dimensional data, and proposed a vector-matrix multiplication on adjacency matrices compressed by extracting their bicliques space-efficiently.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason Grateful to my research collaborators, I could smoothly advance in our joint research plans as well as in my individual research. The research achievements resulted in several journal articles (MDPI Algorithms, MDPI Information, SN Computer Science, Information and Computation) and international conference papers (SPIRE'21, DCC'22, CPM'21). I am confident that these achievements will serve as a solid foundation for further progress in the new fiscal year.
Strategy for Future Research Activity	Although we have proposed algorithms constructing the BBWT [Koeppl et al., CPM'20][Bannai et al., CPM'21], and an index upon the BBWT [Bannai et al., CPM'19], we are unaware of the compression quality of the BBWT, which is expressed by the number of its character runs. Therefore, we want to study the relation between the character runs in the BBWT and the traditional BWT. A first step towards this direction would be the study of particular string families. Here, we want to study the special shape of the BWT when considering strings whose suffix arrays form arithmetic progressions. Speaking about compression, we want to study space-efficient ways in how to compute Huffman-based compression/decompression in constant time per character/codeword. Ideally, we want to find lower bounds on the space, and give solutions with space requirements close to this bound. Another compression technique useful for the matrix-vector multiplication could be grammar compression. Here, we want to study grammars that support such a multiplication time-efficiently. Additionally, we want to devise a pre-computation step for a given input matrix to improve its compressibility by leveraging the fact that we are allowed to shuffle columns and rows of the matrix. Finally, we want to extend our practical implementation of compact hash tables [Koeppl et al., SEA'20] with SIMD instructions to improve query times. SIMD instructions should help us to sustain practical performance if we partition a hash table into relatively large unsorted buckets, on which we perform linear search.
Causes of Carryover	The planned research with focus on modern computer hardware makes it necessary to invest in recent computer architectures featuring, among others, SIMD instruction sets like AVX-512 or graphic card computation. The research funding will also be used to conduct research stays on domestic and international level, as well as to participate at domestic workshops and international conferences.

Research Products
(19 results)

All 2022 2021 Other

All Int'l Joint Research (6 results) Journal Article (11 results) (of which Int'l Joint Research: 11 results, Peer Reviewed: 11 results, Open Access: 5 results) Presentation (1 results) Remarks (1 results)

[Int'l Joint Research] Travis Gagie(カナダ)
- Country Name
  CANADA
- Counterpart Institution
  Travis Gagie
[Int'l Joint Research] Nicola Prezza(イタリア)
- Country Name
  ITALY
- Counterpart Institution
  Nicola Prezza
[Int'l Joint Research] Gonzalo Navarro(チリ)
- Country Name
  CHILE
- Counterpart Institution
  Gonzalo Navarro
[Int'l Joint Research] Marcin Piatkowski(ポーランド)
- Country Name
  POLAND
- Counterpart Institution
  Marcin Piatkowski
[Int'l Joint Research] Robert W. Irving/Lorna Love(英国)
- Country Name
  UNITED KINGDOM
- Counterpart Institution
  Robert W. Irving/Lorna Love
[Int'l Joint Research]
- # of Other Countries
  3
[Journal Article] c-trie++: A dynamic trie tailored for fast prefix searches2022
- Author(s)
  Kazuya Tsuruta and Dominik Koeppl and Shunsuke Kanda and Yuto Nakashima and Shunsuke Inenaga and Hideo Bannai and Masayuki Takeda
- Journal Title
  
  Inf. Comput.
  
  Volume: 285 Part B Pages: 1-22
- DOI
  10.1016/j.ic.2021.104794
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Graph Compression for Adjacency-Matrix Multiplication2022
- Author(s)
  Alexandre P. Francisco and Travis Gagie and Dominik Koeppl and Susana Ladra and Gonzalo Navarro
- Journal Title
  
  SN Computer Science
  
  Volume: 3 Pages: 1-8
- DOI
  10.1007/s42979-022-01084-2
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Inferring Spatial Distance Rankings with Partial Knowledge on Routing Networks2022
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Information
  
  Volume: 13 Pages: 168～168
- DOI
  10.3390/info13040168
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Computing Lexicographic Parsings2022
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Proc. DCC
  
  Volume: 2022 Pages: 232～241
- DOI
  10.1109/DCC52660.2022.00031
- Peer Reviewed / Int'l Joint Research
[Journal Article] HOLZ: High-Order Entropy Encoding of {Lempel--Ziv} Factor Distances2022
- Author(s)
  Dominik Koeppl and Gonzalo Navarro and Nicola Prezza
- Journal Title
  
  Proc. DCC
  
  Volume: 2022 Pages: 83～92
- DOI
  10.1109/DCC52660.2022.00016
- Peer Reviewed / Int'l Joint Research
[Journal Article] FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns2022
- Author(s)
  Jin Jie Deng and Wing-Kai Hon and Dominik Koeppl and Kunihiko Sadakane
- Journal Title
  
  Proc. DCC
  
  Volume: 83--92 Pages: 63～72
- DOI
  10.1109/DCC52660.2022.00014
- Peer Reviewed / Int'l Joint Research
[Journal Article] Reversed Lempel-Ziv Factorization with Suffix Trees2021
- Author(s)
  Koeppl Dominik
- Journal Title
  
  Algorithms
  
  Volume: 14 Pages: 161～161
- DOI
  10.3390/a14060161
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] A Separation of gamma and b via Thue-Morse Words2021
- Author(s)
  Bannai Hideo、Funakoshi Mitsuru、I Tomohiro、Koeppl Dominik、Mieno Takuya、Nishimoto Takaaki
- Journal Title
  
  Proc. SPIRE
  
  Volume: 12944 Pages: 167～178
- DOI
  10.1007/978-3-030-86692-1_14
- Peer Reviewed / Int'l Joint Research
[Journal Article] Grammar Index by Induced Suffix Sorting2021
- Author(s)
  Akagi Tooru、Koeppl Dominik、Nakashima Yuto、Inenaga Shunsuke、Bannai Hideo、Takeda Masayuki
- Journal Title
  
  Proc. SPIRE
  
  Volume: 12944 Pages: 85～99
- DOI
  10.1007/978-3-030-86692-1_8
- Peer Reviewed / Int'l Joint Research
[Journal Article] Extracting the Sparse Longest Common Prefix Array from the Suffix Binary Search Tree2021
- Author(s)
  I Tomohiro、Irving Robert、Koeppl Dominik、Love Lorna
- Journal Title
  
  Proc. SPIRE
  
  Volume: 12944 Pages: 143～150
- DOI
  10.1007/978-3-030-86692-1_12
- Peer Reviewed / Int'l Joint Research
[Journal Article] Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time2021
- Author(s)
  Bannai, Hideo and Kaerkkaeinen, Juha and Koeppl, Dominik and Piatkowski, Marcin
- Journal Title
  
  32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
  
  Volume: 191 Pages: 7:1～7:16
- DOI
  10.4230/LIPIcs.CPM.2021.7
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] 省領域な lexicographic parse 構築アルゴリズム2022
- Author(s)
  Koeppl Dominik
- Organizer
  COMP2021-28
[Remarks] personal home page
- URL
  https://dkppl.de/

2021 Fiscal Year Research-status Report

Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures

Principal Investigator

Koeppl Dominik 東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] Travis Gagie(カナダ)

Country Name

Counterpart Institution

[Int'l Joint Research] Nicola Prezza(イタリア)

Country Name

Counterpart Institution

[Int'l Joint Research] Gonzalo Navarro(チリ)

Country Name

Counterpart Institution

[Int'l Joint Research] Marcin Piatkowski(ポーランド)

Country Name

Counterpart Institution

[Int'l Joint Research] Robert W. Irving/Lorna Love(英国)

Country Name

Counterpart Institution

[Int'l Joint Research]

# of Other Countries

[Journal Article] c-trie++: A dynamic trie tailored for fast prefix searches2022

Author(s)

Journal Title

DOI

[Journal Article] Graph Compression for Adjacency-Matrix Multiplication2022

Author(s)

Journal Title

DOI

[Journal Article] Inferring Spatial Distance Rankings with Partial Knowledge on Routing Networks2022

Author(s)

Journal Title

DOI

[Journal Article] Computing Lexicographic Parsings2022

Author(s)

Journal Title

DOI

[Journal Article] HOLZ: High-Order Entropy Encoding of {Lempel--Ziv} Factor Distances2022

Author(s)

Journal Title

DOI

[Journal Article] FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns2022

Author(s)

Journal Title

DOI

[Journal Article] Reversed Lempel-Ziv Factorization with Suffix Trees2021

Author(s)

Journal Title

DOI

[Journal Article] A Separation of gamma and b via Thue-Morse Words2021

Author(s)

Journal Title

DOI

[Journal Article] Grammar Index by Induced Suffix Sorting2021

Author(s)

Journal Title

DOI

[Journal Article] Extracting the Sparse Longest Common Prefix Array from the Suffix Binary Search Tree2021

Author(s)

Journal Title

DOI

[Journal Article] Constructing the Bijective and the Extended Burrows-Wheeler Transform in Linear Time2021

Author(s)

Journal Title

DOI

[Presentation] 省領域な lexicographic parse 構築アルゴリズム2022

Author(s)

Organizer

[Remarks] personal home page

URL