Data Compression: theoretical and practical approaches to the smallest grammar problem

Research Project

Project/Area Number	21K11745
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 60010:Theory of informatics-related
Research Institution	Tohoku University
Principal Investigator	篠原歩東北大学, 情報科学研究科, 教授 (00226151)
Project Period (FY)	2021-04-01 – 2025-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000) Fiscal Year 2024: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2023: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2022: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2021: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
Keywords	データ圧縮 / 文字列処理 / 機械学習 / 文法推論 / 質問学習 / 最小文法問題 / 文法圧縮 / アルゴリズム
Outline of Research at the Start	可逆的データ圧縮の代表例である文法圧縮に対して，理論と応用の両面から取り組む．最小文法問題とは，入力として与えられた文字列のみを生成する文脈自由文法の中で最もサイズの小さいものを探す組合せ最適化問題である．この問題に対する様々な近似アルゴリズムが提案されており，高性能なデータ圧縮法の技術基盤となっている．本研究は，既存の文法圧縮を確率文法圧縮と高階圧縮の2方向に一般化し，その解法を探求する．理論面においては，この最適化問題の近似困難性の解明と近似アルゴリズムの開発を行う．また応用面から，ここで開発する近似アルゴリズムを実装し，実データに対して有用な可逆圧縮システムを構築する．
Outline of Annual Research Achievements	本研究は，可逆的データ圧縮の代表例である文法圧縮に対して，理論と応用の両面から取り組んでいる．今年度は関連する文字列処理と機械学習理論に関して大きな進展があった．まず，文字の置換を許容して構造の一致を見つけ出すパラメータ化パターン照合問題に関して，実行時間をあまり落とさずに省領域で動作するアルゴリズムの開発を行った．そのために，パラメータ化された文字列の繰り返し構造に関する組合せ論的な解析を行い，そこに内在する周期性を特徴付ける補題を得た．この性質を利用することで，アルファベットサイズを定数とみなした場合に線形時間かつ対数領域で動作するパターン照合アルゴリズムを得ることができた．一方，文法圧縮と深く関係する形式言語の文法推論に関して，アルファベットサイズが極めて多いまたは無限の場合にも適用できる学習アルゴリズムを2つの代表的な学習モデルにおいて提案した．一つめは，観測結果が受動的に与えられる環境下での学習可能性を考察する極限同定の枠組みにおいて，可代入性を満たす文脈自由言語を正の例のみから効率的に学習するものである．負の例を必要としないという点で，より有用性の高いアプローチである．2つめは，教師に対する質問によって目標概念を厳密に同定することをモデル化した質問学習の枠組みにおいて，特に質問に対してyesともnoとも答えないことがある不完全教師のもとで，2つの正則言語を分離する簡潔なオートマトンを学習するものである．その正当性を証明すると共に，計算機実験によって既存のアルゴリズムよりも効率よく学習が行えることを実証した．
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 理由最大800文字（1600バイト）最小文法問題とは，入力として与えられた文字列のみを生成する文脈自由文法の中で最もサイズの小さいものを探す組合せ最適化問題である．この問題に対する様々な近似アルゴリズムが提案されており，高性能なデータ圧縮法の技術基盤となっている．今年度は，研究実績の概要欄に述べたとおり，文字列処理と機械学習理論に関して大きな進展があった．特に，パラメータ化パターン照合問題に対して，アルファベットサイズを定数とみなした場合に線形時間で動作する初の劣線形領域アルゴリズムを開発することができた．大量のデータを処理する上で時間計算量を悪化させずに領域計算量を下げることは重要な課題であり，様々な波及効果が期待できる．一方，形式言語の文法推論に関して得られた2つのアプローチは，いずれもアルファベットサイズが極めて大きい，あるいは無限の場合にも有用なものであり，数値系列のデータ群を対象とした学習問題を実応用として見据えている．また，質問学習における不完全教師モデルは，すべてのデータが正しく観測可能であるとは限らない環境を想定しており，深層学習などの技法を用いて得られた予測精度の高いブラックボックスから文法構造を明示的に取り出すための基盤技術となる．さらに，実データに対するデータ圧縮に関して，同種のデータが多数ある場合に役立ついくつかのアプローチについても引き続き検討を重ねており，プロトタイプを実装して予備実験を行っている．
Strategy for Future Research Activity	今年度の成果を足がかりとして，最終年度にむけてさらに研究を展開していく予定である．高階文法や確率文法を含めた文法圧縮の実装上の効率化については，その鍵となる文字列処理についてさらに幅広く深く調査を進める予定である．また，文字列方程式の解法との関連についての考察も視野にいれる．さらにパラメータ化照合をデータ圧縮にうまく活用する方法を模索しながら実装実験を行い，その可能性を探っていく．一方，形式言語理論の枠組みの中でデータ圧縮を文法推論として捉え，与えられたデータからそこに内在する文法や有限オートマトンとして推測する学習理論について，さらなる展開を目指している．帰納推論や質問学習などの設定で，理論的な深化を目指すと同時に，実用上の観点から，数値データや時系列データを対象とした計算機実験を行い，有用性を検証する．さらに，文法最小化問題という組合せ最適化問題を深層強化学習の技術と計算機パワーによって実用的に解くという試みに関しても，引き続き実験を継続していき，得られた知見をまとめる予定である．

Report

(3 results)

Research Products
(13 results)

All 2024 2023 2022 2021

All Journal Article (4 results) (of which Peer Reviewed: 4 results, Open Access: 1 results) Presentation (9 results) (of which Int'l Joint Research: 7 results)

[Journal Article] Query Learning of Minimal Deterministic Symbolic Finite Automata Separating Regular Languages2024
- Author(s)
  Yoshito Kawasaki, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
- Journal Title
  
  Lecture Notes in Computer Science
  
  Volume: 14519 Pages: 340-354
- DOI
  10.1007/978-3-031-52113-3_24
- ISBN
  9783031521126, 9783031521133
- Related Report
  2023 Research-status Report
- Peer Reviewed
[Journal Article] Identification of Substitutable Context-Free Languages over Infinite Alphabets from Positive Data2023
- Author(s)
  Yutaro Numaya, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
- Journal Title
  
  Proceedings of Machine Learning Research
  
  Volume: 217 Pages: 23-34
- Related Report
  2023 Research-status Report
- Peer Reviewed / Open Access
[Journal Article] Efficient Parameterized Pattern Matching in Sublinear Space2023
- Author(s)
  Haruki Ideguchi, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
- Journal Title
  
  Lecture Notes in Computer Science
  
  Volume: 14240 Pages: 271-283
- DOI
  10.1007/978-3-031-43980-3_22
- ISBN
  9783031439797, 9783031439803
- Related Report
  2023 Research-status Report
- Peer Reviewed
[Journal Article] Parameterized DAWGs: Efficient constructions and bidirectional pattern searches2022
- Author(s)
  Katsuhito Nakashima, Noriki Fujisato, Diptarama Hendrian, Yuto Nakashima, Ryo Yoshinaka, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Masayuki Takeda
- Journal Title
  
  Theoretical Computer Science
  
  Volume: 933 Pages: 21-42
- DOI
  10.1016/j.tcs.2022.09.008
- Related Report
  2022 Research-status Report
- Peer Reviewed
[Presentation] Inferring Strings from Position Heaps in Linear Time2023
- Author(s)
  Kumagai Koshiro、Hendrian Diptarama、Yoshinaka Ryo、Shinohara Ayumi
- Organizer
  The 17th International Conference and Workshops on Algorithms and Computation 2023 (WALCOM2023)
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] ネックレス文字列上の極小単出現と極大反復出現の計算2023
- Author(s)
  森竹涼樹，熊谷滉士郎，ディプタラマヘンリアン，吉仲亮，篠原歩
- Organizer
  冬のLAシンポジウム
- Related Report
  2022 Research-status Report
[Presentation] EMOW型ポジションヒープの逆問題2023
- Author(s)
  熊谷滉士郎，ディプタラマヘンリアン，吉仲亮，篠原歩
- Organizer
  冬のLAシンポジウム
- Related Report
  2022 Research-status Report
[Presentation] Computing the Parameterized Burrows-Wheeler Transform Online2022
- Author(s)
  Hashimoto Daiki, Hendrian Diptarama, Koeppl Dominik, Yoshinaka Ryo, Shinohara Ayumi
- Organizer
  The 29th International Symposium on String Processing and Information Retrieval 2022 (SPIRE2022)
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] Parallel Algorithm for Pattern Matching Problems Under Substring Consistent Equivalence Relations2022
- Author(s)
  Jargalsaikhan Davaajav, Hendrian Diptarama, Yoshinaka Ryo, Shinohara Ayumi
- Organizer
  The 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)
- Related Report
  2022 Research-status Report
- Int'l Joint Research
[Presentation] Query Learning Algorithm for Symbolic Weighted Finite Automata2021
- Author(s)
  Kaito Suzuki, Diptarama Hendrian, Ryo Yoshinaka, Ayumi Shinohara
- Organizer
  The 15th International Conference on Grammatical Inference
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Inside-Outside Algorithm for Macro Grammars2021
- Author(s)
  Ryuta Kambe, Naoki Kobayashi, Ryosuke Sato, Ayumi Shinohara and Ryo Yoshinaka
- Organizer
  The 15th International Conference on Grammatical Inference
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Query Learning of Symbolic Weighted Finite Automata2021
- Author(s)
  Kaito Suzuki, Diptarama Hendrian, Ryo Yoshinaka and Ayumi Shinohara
- Organizer
  The 14th Annual Meeting of the Asian Association for Algorithms and Computation
- Related Report
  2021 Research-status Report
- Int'l Joint Research
[Presentation] Efficient Construction of Cryptarithm Catalogues over Deterministic Finite Automata2021
- Author(s)
  Koya Watanabe, Diptarama Hendrian, Ryo Yoshinaka, Takashi Horiyama and Ayumi Shinohara
- Organizer
  The 14th Annual Meeting of the Asian Association for Algorithms and Computation
- Related Report
  2021 Research-status Report
- Int'l Joint Research

Data Compression: theoretical and practical approaches to the smallest grammar problem

Principal Investigator

篠原 歩 東北大学, 情報科学研究科, 教授 (00226151)

¥4,290,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥990,000)

Current Status of Research Progress

Reason

Report

Research Products

[Journal Article] Query Learning of Minimal Deterministic Symbolic Finite Automata Separating Regular Languages2024

Author(s)

Journal Title

DOI

ISBN

Related Report

[Journal Article] Identification of Substitutable Context-Free Languages over Infinite Alphabets from Positive Data2023

Author(s)

Journal Title

Related Report

[Journal Article] Efficient Parameterized Pattern Matching in Sublinear Space2023

Author(s)

Journal Title

DOI

ISBN

Related Report

[Journal Article] Parameterized DAWGs: Efficient constructions and bidirectional pattern searches2022

Author(s)

Journal Title

DOI

Related Report

[Presentation] Inferring Strings from Position Heaps in Linear Time2023

Author(s)

Organizer

Related Report

[Presentation] ネックレス文字列上の極小単出現と極大反復出現の計算2023

Author(s)

Organizer

Related Report

[Presentation] EMOW型ポジションヒープの逆問題2023

Author(s)

Organizer

Related Report

[Presentation] Computing the Parameterized Burrows-Wheeler Transform Online2022

Author(s)

Organizer

Related Report

[Presentation] Parallel Algorithm for Pattern Matching Problems Under Substring Consistent Equivalence Relations2022

Author(s)

Organizer

Related Report

[Presentation] Query Learning Algorithm for Symbolic Weighted Finite Automata2021

Author(s)

Organizer

Related Report

[Presentation] Inside-Outside Algorithm for Macro Grammars2021

Author(s)

Organizer

Related Report

[Presentation] Query Learning of Symbolic Weighted Finite Automata2021

Author(s)

Organizer

Related Report

[Presentation] Efficient Construction of Cryptarithm Catalogues over Deterministic Finite Automata2021

Author(s)

Organizer

Related Report

篠原歩東北大学, 情報科学研究科, 教授 (00226151)