2022 Fiscal Year Research-status Report
Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures
Project/Area Number |
21K17701
|
Research Institution | Tokyo Medical and Dental University |
Principal Investigator |
Koeppl Dominik 東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)
|
Project Period (FY) |
2021-04-01 – 2025-03-31
|
Keywords | data compression / matrix multiplication / matrix compression / subsequences / compact hashing / SIMD instructions / compressed indexes / hybrid text indexes |
Outline of Annual Research Achievements |
Following the research plan for the fiscal year 2022, our main objectives were: 1. Improving matrix-vector multiplication in compressed space when indexing the matrix is allowed 2. Practically engineering a compact hash table for storing a dynamic set of integers of small bit widths 1. We provided three different compression approaches for indexing matrices in compressed space. In the first approach, we used the WebGraph framework of Vigna, which uses a row-based referencing compression for binary matrices. Another approach is the extraction of bicliques when interpreting the binary matrix as the adjacency matrix of a graph. The last approach works on arbitrarily-values matrices by applying a grammar compressor on the rows linearised to a string by concatenating the rows by unique delimiters. This approach is favorable if the data stored in the matrix is structured such that a grammar compressor can make use of the structured repetitions. Small grammars lead to tiny arithmetic circuits, which we build as a compressed index for accelerating matrix-vector multiplication. 2. We could devise a compact hash table that gains speed-ups with modern SIMD instructions. The hash table layout is based on hashing with chaining, but with arrays instead of lists. While a linear scan of these arrays is slow, this operation can be accelerated via SIMD instructions. This approach is prospective in the light that modern computer architectures gain rapidly larger cache sizes and larger bit widths for SIMD instructions while processor clock speed improvements have become marginal.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
We could find three solutions tackling the matrix-vector multiplication problem addressed in the project proposal.These achievements spawn several new ideas we would like to address in the future: Can the indices be extended to allow not only matrix-vector multiplication, but also vector-matrix multiplication? Is it possible to use two indices for a matrix-matrix multiplication? We have also seen some remarkable characteristics when devising hybrid indices, as we did in [Deng et al. DCC'22] and [Akagi et al. SPIRE'21].We also could improve classic compression algorithms such as the Huffman encoding theoretically, while solving the practical part of the project proposal with an implementation of a hash table for integers. In future, want to address the dynamic part addressed in the project proposal.
|
Strategy for Future Research Activity |
We want to continue with the following three problems: First, our plan is to propose compressed text indices for approximate pattern matching. Here, we start with allowing one error in the pattern, and then try to generalize the proposed approach if possible. Further, we want to study how to speed up the process of indexing multiple copies of sequences with small disturbances simulating SNPs in biological sequences. Subsequently, we plan to compare this approach with the first approach for the approximate pattern matching index, possibly obtaining different trade-offs when combining both approaches. The next goal is to study under which space it is possible to compress a text for supporting logarithmic time for accessing a single character of the text. This is possible with grammars or block trees, but such compressed representations can be relatively large. We wonder whether other compressed representations with better theoretical guarantees exist. Finally, we continue with our framework on generalizing string regularities from substrings to subsequences. To this end, we have started working on improvements in finding the longest Lyndon subsequence with better space and time bounds as presented at IWOCA last year. We also want to devise an efficient algorithm enumerating all distinct Lyndon subsequences of a given input. Such an enumeration algorithm, when practically efficient, can also help in improving the bound on the maximal number of all distinct Lyndon subsequences a string of length n with σ distinct characters can have.
|
Research Products
(23 results)