2021 Fiscal Year Research-status Report
Indexing Massive Datasets with Algorithmic Engineered Compression Techniques on Modern Computer Architectures
Project/Area Number |
21K17701
|
Research Institution | Tokyo Medical and Dental University |
Principal Investigator |
Koeppl Dominik 東京医科歯科大学, M&Dデータ科学センター, 助教 (50897395)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Keywords | compression techniques / indexing data structures / matrix compression / algorithm engineering |
Outline of Annual Research Achievements |
This research lies in the intersection of compression techniques, indexing data structures, and algorithm engineering on modern computer architectures. Regarding compression, we proposed a space-efficient algorithm computing the reversed LZ-factorization in linear time. This algorithm can be modified to compute the longest previous non-overlapping reverse factor table. Next, we proposed a new representation of Lempel-Ziv 77 factors, which are usually represented by pairs of text offset and length. By exchanging the text offsets with the offsets within the list of co-lexicographically sorted prefixes read up to the starting position of the respective factor, we could empirically observe that these offsets tend to be smaller than the text offsets, improving the final compressed size when coding the pairs with a universal coder. Finally, we provided a linear-time construction algorithm of the bijective Burrows-Wheeler transform (BBWT), which can be used for data compression and for compressed text indexes. Speaking about compressed text indexes, we presented text indexes built on grammars based on suffix sorting, and showed that this grammar exhibits locality sensitive properties such that finding a pattern in the text can be done efficiently by constructing the same grammar on the pattern, and searching the non-terminals of the pattern in the grammar tree of the text. Finally, we departed from one-dimensional data, and proposed a vector-matrix multiplication on adjacency matrices compressed by extracting their bicliques space-efficiently.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
Grateful to my research collaborators, I could smoothly advance in our joint research plans as well as in my individual research. The research achievements resulted in several journal articles (MDPI Algorithms, MDPI Information, SN Computer Science, Information and Computation) and international conference papers (SPIRE'21, DCC'22, CPM'21). I am confident that these achievements will serve as a solid foundation for further progress in the new fiscal year.
|
Strategy for Future Research Activity |
Although we have proposed algorithms constructing the BBWT [Koeppl et al., CPM'20][Bannai et al., CPM'21], and an index upon the BBWT [Bannai et al., CPM'19], we are unaware of the compression quality of the BBWT, which is expressed by the number of its character runs. Therefore, we want to study the relation between the character runs in the BBWT and the traditional BWT. A first step towards this direction would be the study of particular string families. Here, we want to study the special shape of the BWT when considering strings whose suffix arrays form arithmetic progressions. Speaking about compression, we want to study space-efficient ways in how to compute Huffman-based compression/decompression in constant time per character/codeword. Ideally, we want to find lower bounds on the space, and give solutions with space requirements close to this bound. Another compression technique useful for the matrix-vector multiplication could be grammar compression. Here, we want to study grammars that support such a multiplication time-efficiently. Additionally, we want to devise a pre-computation step for a given input matrix to improve its compressibility by leveraging the fact that we are allowed to shuffle columns and rows of the matrix. Finally, we want to extend our practical implementation of compact hash tables [Koeppl et al., SEA'20] with SIMD instructions to improve query times. SIMD instructions should help us to sustain practical performance if we partition a hash table into relatively large unsorted buckets, on which we perform linear search.
|
Causes of Carryover |
The planned research with focus on modern computer hardware makes it necessary to invest in recent computer architectures featuring, among others, SIMD instruction sets like AVX-512 or graphic card computation. The research funding will also be used to conduct research stays on domestic and international level, as well as to participate at domestic workshops and international conferences.
|
Research Products
(19 results)