Project/Area Number |
18K11423
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | University of Tsukuba |
Principal Investigator |
YAMAMOTO Mikio 筑波大学, システム情報系, 教授 (40210562)
|
Project Period (FY) |
2018-04-01 – 2021-03-31
|
Project Status |
Completed (Fiscal Year 2020)
|
Budget Amount *help |
¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)
Fiscal Year 2020: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000)
Fiscal Year 2019: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000)
Fiscal Year 2018: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
|
Keywords | ngram言語モデル / ダブル配列 / 双方向配置 / 文字列マッチング / 細粒度並列化 / Boyer-Moore / Elias-Fano符号化法 / Remapping / 部分転置ダブル配列 |
Outline of Final Research Achievements |
The implementation of ngram language models using the partially transposed double array is excellent in terms of both access speed and model size, but has the disadvantage that it takes a very long time to build the model (data structure). The essential difficulty lies in arranging hundreds of millions to billions of child node arrays (with gaps) in a single array so that they do not collide with each other. Due to the large interdependence, it is difficult to increase the speed by techniques such as simple parallelization. In this study, we deeply examined the properties of the partially transposed double array, realized a faster model construction by multiple acceleration methods, and at the same time achieved a higher compression rate.
|
Academic Significance and Societal Importance of the Research Achievements |
ngram言語モデルは音声認識や統計的機械翻訳技術の基盤技術であるため、本研究の成果によって高速かつコンパクトなngram言語モデルを短時間で作成できるようになった点に意義がある。また、より広い観点からは、ダブル配列はトライと呼ばれる一般的な辞書データ構造の実現方法の一つであり、本研究は巨大なデータに対するトライを高速かつコンパクトに実現できるという意味で巨大な辞書を必要とする広いアプリケーションに対しても有効である。
|