Project/Area Number |
20K06612
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 43060:System genome science-related
|
Research Institution | National Institute of Genetics |
Principal Investigator |
Kryukov Kirill 国立遺伝学研究所, Biological Networks Laboratory, 特命准教授 (20806202)
|
Project Period (FY) |
2020-04-01 – 2023-03-31
|
Project Status |
Completed (Fiscal Year 2022)
|
Budget Amount *help |
¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)
Fiscal Year 2022: ¥910,000 (Direct Cost: ¥700,000、Indirect Cost: ¥210,000)
Fiscal Year 2021: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Fiscal Year 2020: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
|
Keywords | Data compression / NAF / GenomeSync / Genome database / DNA compression / Sequence analysis |
Outline of Research at the Start |
Biological and medical research uses huge databases of genome sequences. Currently all such databases use outdated compression technology. The NAF compression format that we invented recently allows to making databases more compact and much faster to access. This project focuses on developing infrastructure that will allow the field to transition to the NAF format. Such infrastructure includes reference genome database in NAF format, and software tools supporting this format. This project will improve efficiency of biological and medical research, contributing to science and public health.
|
Outline of Final Research Achievements |
The achievements of this project: (1) Continued development, maintenance, and popularization of the Nucleotide Archival Format (NAF). Additions: Improved compression strength, improved customization of decompressed format, support for storing multiple files, added Bioconda installation option. (2) Evaluation of performance of various compressors in the Sequence Compression Benchmark - the most comprehensive benchmark of available compressors for biological sequence data. This benchmark clearly shows that NAF is a superior format for storing and working with sequence data. The benchmark paper has 25 Google Scholar citations. (3) Distributing NAF-compressed genome sequences via the GenomeSync database - one of the largest genome databases. Now GenomeSync offers convenient access to over 640,000 genomes, thanks to the efficiency of the NAF format. (4) Supported NAF in bioinformatic tools such as Genome Search Toolkit and Primer Tester. (5) 9 papers were published related to this project.
|
Academic Significance and Societal Importance of the Research Achievements |
Genome data is increasingly used across many fields of science. NAF greatly increases efficiency of working with such data compared to previous formats. This project applied, improved and advanced NAF towards becoming the fundamental infrastructure tool for the next generation of genome databases.
|