• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2022 Fiscal Year Research-status Report

Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures

Research Project

Project/Area Number 21K17701
Research InstitutionTokyo Medical and Dental University

Principal Investigator

Koeppl Dominik  東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)

Project Period (FY) 2021-04-01 – 2025-03-31
Keywordsdata compression / matrix multiplication / matrix compression / subsequences / compact hashing / SIMD instructions / compressed indexes / hybrid text indexes
Outline of Annual Research Achievements

Following the research plan for the fiscal year 2022, our main objectives were:
1. Improving matrix-vector multiplication in compressed space when indexing the matrix is allowed
2. Practically engineering a compact hash table for storing a dynamic set of integers of small bit widths
1. We provided three different compression approaches for indexing matrices in compressed space. In the first approach, we used the WebGraph framework of Vigna, which uses a row-based referencing compression for binary matrices. Another approach is the extraction of bicliques when interpreting the binary matrix as the adjacency matrix of a graph. The last approach works on arbitrarily-values matrices by applying a grammar compressor on the rows linearised to a string by concatenating the rows by unique delimiters. This approach is favorable if the data stored in the matrix is structured such that a grammar compressor can make use of the structured repetitions. Small grammars lead to tiny arithmetic circuits, which we build as a compressed index for accelerating matrix-vector multiplication.
2. We could devise a compact hash table that gains speed-ups with modern SIMD instructions. The hash table layout is based on hashing with chaining, but with arrays instead of lists. While a linear scan of these arrays is slow, this operation can be accelerated via SIMD instructions. This approach is prospective in the light that modern computer architectures gain rapidly larger cache sizes and larger bit widths for SIMD instructions while processor clock speed improvements have become marginal.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

We could find three solutions tackling the matrix-vector multiplication problem addressed in the project proposal.These achievements spawn several new ideas we would like to address in the future: Can the indices be extended to allow not only matrix-vector multiplication, but also vector-matrix multiplication?
Is it possible to use two indices for a matrix-matrix multiplication?
We have also seen some remarkable characteristics when devising hybrid indices, as we did in [Deng et al. DCC'22] and [Akagi et al. SPIRE'21].We also could improve classic compression algorithms such as the Huffman encoding theoretically,
while solving the practical part of the project proposal with an implementation of a hash table for integers.
In future, want to address the dynamic part addressed in the project proposal.

Strategy for Future Research Activity

We want to continue with the following three problems:
First, our plan is to propose compressed text indices for approximate pattern matching. Here, we start with allowing one error in the pattern, and then try to generalize the proposed approach if possible. Further, we want to study how to speed up the process of indexing multiple copies of sequences with small disturbances simulating SNPs in biological sequences. Subsequently, we plan to compare this approach with the first approach for the approximate pattern matching index, possibly obtaining different trade-offs when combining both approaches. The next goal is to study under which space it is possible to compress a text for supporting logarithmic time for accessing a single character of the text. This is possible with grammars or block trees, but such compressed representations can be relatively large. We wonder whether other compressed representations with better theoretical guarantees exist.
Finally, we continue with our framework on generalizing string regularities from substrings to subsequences. To this end, we have started working on improvements in finding the longest Lyndon subsequence with better space and time bounds as presented at IWOCA last year. We also want to devise an efficient algorithm enumerating all distinct Lyndon subsequences of a given input. Such an enumeration algorithm, when practically efficient, can also help in improving the bound on the maximal number of all distinct Lyndon subsequences a string of length n with σ distinct characters can have.

  • Research Products

    (23 results)

All 2023 2022 Other

All Int'l Joint Research (6 results) Journal Article (12 results) (of which Int'l Joint Research: 12 results,  Peer Reviewed: 12 results,  Open Access: 8 results) Presentation (4 results) Remarks (1 results)

  • [Int'l Joint Research] Dalhousie University(カナダ)

    • Country Name
      CANADA
    • Counterpart Institution
      Dalhousie University
  • [Int'l Joint Research] University of A Coruna(スペイン)

    • Country Name
      SPAIN
    • Counterpart Institution
      University of A Coruna
  • [Int'l Joint Research] University of Chile(チリ)

    • Country Name
      CHILE
    • Counterpart Institution
      University of Chile
  • [Int'l Joint Research] Max Planck Institute for Informatics(ドイツ)

    • Country Name
      GERMANY
    • Counterpart Institution
      Max Planck Institute for Informatics
  • [Int'l Joint Research] University of Helsinki(フィンランド)

    • Country Name
      FINLAND
    • Counterpart Institution
      University of Helsinki
  • [Int'l Joint Research]

    • # of Other Countries
      4
  • [Journal Article] Dynamic Skyline Computation with LSD Trees2023

    • Author(s)
      Dominik Koeppl
    • Journal Title

      Analytics

      Volume: 2 Pages: 146-162

    • DOI

      10.3390/analytics2010009

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Space-efficient Huffman codes revisited2023

    • Author(s)
      Szymon Grabowski and Dominik Koeppl
    • Journal Title

      Information Processing Letters

      Volume: 179 Pages: 1-8

    • DOI

      10.1016/j.ipl.2022.106274

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] c-trie++: A dynamic trie tailored for fast prefix searches2022

    • Author(s)
      Kazuya Tsuruta and Dominik Koeppl and Shunsuke Kanda and Yuto Nakashima and Shunsuke Inenaga and Hideo Bannai and Masayuki Takeda
    • Journal Title

      Inf. Comput.

      Volume: 285 Part B Pages: 1-22

    • DOI

      10.1016/j.ic.2021.104794

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Graph Compression for Adjacency-Matrix Multiplication2022

    • Author(s)
      Alexandre P. Francisco and Travis Gagie and Dominik Koeppl and Susana Ladra and Gonzalo Navarro
    • Journal Title

      SN Computer Science

      Volume: 3 Pages: 1-8

    • DOI

      10.1007/s42979-022-01084-2

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Computing Longest (Common) Lyndon Subsequences2022

    • Author(s)
      Hideo Bannai and Tomohiro I and Tomasz Kociumaka and Dominik Koeppl and Simon J. Puglisi
    • Journal Title

      Proceedings of IWOCA

      Volume: 13270 Pages: 128-142

    • DOI

      10.1007/978-3-031-06678-8_10

    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Space-Efficient B Trees via Load-Balancing2022

    • Author(s)
      Tomohiro I and Dominik Koeppl
    • Journal Title

      Proceedings of IWOCA

      Volume: 13270 Pages: 327-340

    • DOI

      10.1007/978-3-031-06678-8_24

    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Linking Off-Road Points to Routing Networks2022

    • Author(s)
      Dominik Koeppl
    • Journal Title

      Algorithms

      Volume: 15(5) Pages: 1-15

    • DOI

      10.3390/a15050163

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Fast and Simple Compact Hashing via Bucketing2022

    • Author(s)
      Dominik Koeppl and Simon J. Puglisi and Rajeev Raman
    • Journal Title

      Algorithmica

      Volume: 84 Pages: 2735-2766

    • DOI

      10.1007/s00453-022-00996-y

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Computing the Parameterized Burrows-Wheeler Transform Online2022

    • Author(s)
      Daiki Hashimoto and Diptarama Hendrian and Dominik Koeppl and Ryo Yoshinaka and Ayumi Shinohara
    • Journal Title

      Proceedings of SPIRE

      Volume: 13617 Pages: 70-85

    • DOI

      10.1007/978-3-031-20643-6_6

    • Peer Reviewed / Int'l Joint Research
  • [Journal Article] Accessing the Suffix Array via $\phi^-1$-Forest2022

    • Author(s)
      Christina Boucher and Dominik Koeppl and Herman Perera and Massimiliano Rossi
    • Journal Title

      Proceedings of SPIRE

      Volume: 13617 Pages: 86-98

    • DOI

      10.1007/978-3-031-20643-6_7

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Computing NP-hard Repetitiveness Measures via MAX-SAT2022

    • Author(s)
      Hideo Bannai and Keisuke Goto and Masakazu Ishihata and Shunsuke Kanda and Dominik Koeppl and Takaaki Nishimoto
    • Journal Title

      Proceedings of ESA

      Volume: 244 Pages: 12:1-12:16

    • DOI

      10.4230/LIPIcs.ESA.2022.12

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Improving Matrix-vector Multiplication via Lossless Grammar-Compressed Matrices2022

    • Author(s)
      Paolo Ferragina and Giovanni Manzini and Travis Gagie and Dominik Koeppl and Gonzalo Navarro and Manuel Striani and Francesco Tosoni
    • Journal Title

      Proc. VLDB

      Volume: 15 Pages: 2175-2187

    • DOI

      10.14778/3547305.3547321

    • Peer Reviewed / Int'l Joint Research
  • [Presentation] r インデックスにおける接尾辞配列を模倣するデータ構造2023

    • Author(s)
      Christina Boucher and Dominik Koeppl and Herman Perera and Massimiliano Rossi
    • Organizer
      Local Proceedings of the LA Symposium Winter 2022
  • [Presentation] アルファベット順による lex-parse サイズ比2023

    • Author(s)
      中島 祐人 and クップル ドミニク and 舩越 満 and 稲永 俊介
    • Organizer
      Local Proceedings of the 191th アルゴリズム研究会
  • [Presentation] 接尾辞木に基づくLZ77とLPF配列の変種の計算2022

    • Author(s)
      クップル ドミニク
    • Organizer
      Local Proceedings of コンピュテーション研究会
  • [Presentation] Lempel-Ziv 項の距離を高次情報量で表現する符号2022

    • Author(s)
      Dominik Koeppl and Gonzalo Navarro and Nicola Prezza
    • Organizer
      Local Proceedings of the 190th アルゴリズム研究会
  • [Remarks] Personal Homepage

    • URL

      https://dkppl.de/

URL: 

Published: 2023-12-25  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi