Construction speedup and deepening of partially transpose double array ngram language models

Research Project

Project/Area Number	18K11423
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	University of Tsukuba
Principal Investigator	YAMAMOTO Mikio 筑波大学, システム情報系, 教授 (40210562)
Project Period (FY)	2018-04-01 – 2021-03-31
Project Status	Completed (Fiscal Year 2020)
Budget Amount *help	¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000) Fiscal Year 2020: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000) Fiscal Year 2019: ¥1,560,000 (Direct Cost: ¥1,200,000、Indirect Cost: ¥360,000) Fiscal Year 2018: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
Keywords	ngram言語モデル / ダブル配列 / 双方向配置 / 文字列マッチング / 細粒度並列化 / Boyer-Moore / Elias-Fano符号化法 / Remapping / 部分転置ダブル配列
Outline of Final Research Achievements	The implementation of ngram language models using the partially transposed double array is excellent in terms of both access speed and model size, but has the disadvantage that it takes a very long time to build the model (data structure). The essential difficulty lies in arranging hundreds of millions to billions of child node arrays (with gaps) in a single array so that they do not collide with each other. Due to the large interdependence, it is difficult to increase the speed by techniques such as simple parallelization. In this study, we deeply examined the properties of the partially transposed double array, realized a faster model construction by multiple acceleration methods, and at the same time achieved a higher compression rate.
Academic Significance and Societal Importance of the Research Achievements	ngram言語モデルは音声認識や統計的機械翻訳技術の基盤技術であるため、本研究の成果によって高速かつコンパクトなngram言語モデルを短時間で作成できるようになった点に意義がある。また、より広い観点からは、ダブル配列はトライと呼ばれる一般的な辞書データ構造の実現方法の一つであり、本研究は巨大なデータに対するトライを高速かつコンパクトに実現できるという意味で巨大な辞書を必要とする広いアプリケーションに対しても有効である。