データ圧縮に基づく高速テキストマイニング

Research Project

Project/Area Number	13780248
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Single-year Grants
Research Field	計算機科学
Research Institution	Kyushu University
Principal Investigator	竹田正幸九州大学, 大学院・システム情報科学研究院, 助教授 (50216909)
Project Period (FY)	2001 – 2002
Project Status	Completed (Fiscal Year 2002)
Budget Amount *help	¥2,400,000 (Direct Cost: ¥2,400,000) Fiscal Year 2002: ¥900,000 (Direct Cost: ¥900,000) Fiscal Year 2001: ¥1,500,000 (Direct Cost: ¥1,500,000)
Keywords	データ圧縮 / 機械発見 / 類似性指標 / パターン発見 / 計算量 / 索引 / 圧縮と発見 / パターン照合
Research Abstract	本研究では,「データ圧縮に基づく高速テキストマイニング」という研究課題を掲げ,以下の3つの研究項目について研究を進めた。 (A)文字列処理に基づく知識発見手法の開発。 (B)文字列データ圧縮と知識発見。 (C)知識発見処理の高速化のための基礎技術開発。 (A)については,国文学研究あるいは音楽情報処理において用いることのできる類似性指標を,本研究で導入した形式的体系であるSRSに沿って,実際に定義し,実装して有効性を検証した。また,本研究で開発したパタン発見アルゴリズムを,ゲノム情報に適用し,生物学的知見を得ることに成功した。(B)については,代表的索引構造であるDAWGに基づくLempel-Ziv圧縮法の新しい実装法を開発した。また,圧縮テキストを表す形式的体系として本研究で導入したコラージュシステムを対象とし,この形式で表された入力に対して編集距離を求める効率的アルゴリズムを開発した。この手法により,相同配列検索の高速化が期待できる。(C)については,テキスト索引方式の研究を集中して行い,DAWGという索引構造をもとにした新しい索引構造であるMASDAWGを開発・実装し,この構造を用いることで,長年の課題であった「機械学習システムBONSAIで正規パタンを扱えるようにすること」に成功した。これによって,従来は事実上不可能であった計算を実時間内に終了することができるようになり,塩基配列やアミノ酸配列からのより高度な知識発見処理が可能となった。さらに(C)に関して,半構造テキストデータからの発見が重要であるとの認識に立ち,大量のXML文書データを効率的に処理する独自の方式を考案し,プロトタイプシステムを構築した。現段階においても,きわめて高速であり,検索のみならず,データの集計・変換・抽出など具体的な多くの用途に使用できる。

Report

(2 results)

2002 Annual Research Report
2001 Annual Research Report

Research Products
(18 results)

All Other

All Publications (18 results)

[Publications] M.Takeda et al.: "Discovering instance of poetic allusion from anthologies of classical Japanese poems"Theoretical Computer Science. 292(2). 497-524 (2003)
- Related Report
  2002 Annual Research Report
[Publications] M.Takeda et al.: "Discovering charactersitic expressions from literary works"Theoretical Computer Science. 292(2). 525-546 (2003)
- Related Report
  2002 Annual Research Report
[Publications] Y.Hayashi et al.: "Uniform characterization of polynomial-query learnabilities"Theoretical Computer Science. 292(2). 377-385 (2003)
- Related Report
  2002 Annual Research Report
[Publications] M.Hirao et al.: "A practical algorithm to find the best subsequences patterns"Theoretical Computer Science. 292(2). 465-79 (2003)
- Related Report
  2002 Annual Research Report
[Publications] T.Kida et al.: "A unifying framework for compressed pattern matching"Theoretical Computer Science. (to appear).
- Related Report
  2002 Annual Research Report
[Publications] H.Bannai et al.: "A String Pattern Regression algorithm and Its Application to Pattern Discovery in Long Introns"Genome Informatics. 13. 3-11 (2002)
- Related Report
  2002 Annual Research Report
[Publications] S.Inenaga et al.: "Discovering Best Variable-Length-Don't-Care Patterns"Lecture Notes in Artificial Intelligence. 2534. 86-97 (2002)
- Related Report
  2002 Annual Research Report
[Publications] K.Baba et al.: "A note on Randomized Algorithm for String Matching with Mismatches"Proc.The Prague Stringology Conference '02(PSC'02). 9-17 (2002)
- Related Report
  2002 Annual Research Report
[Publications] S.Inenaga et al.: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476. 310-324 (2002)
- Related Report
  2002 Annual Research Report
[Publications] M.Takeda et al.: "Processing Text Files as Is : Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts"Lecture Notes in Computer Science. 2476. 170-186 (2002)
- Related Report
  2002 Annual Research Report
[Publications] S.Inenaga et al.: "Space-Economical Construction of Index Structures for All-Suffixes of a String"Lecture Notes in Computer Science. 2420. 341-352 (2002)
- Related Report
  2002 Annual Research Report
[Publications] S.Inenaga et al.: "The Minimum DAWG for All Suffixes of a String and Its Applications"Lecture Notes in Computer Science. 2373. 151-165 (2002)
- Related Report
  2002 Annual Research Report
[Publications] M.Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)
- Related Report
  2001 Annual Research Report
[Publications] T.Kadota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8^<th> International Symposium on String Processing and Information Retrieval. 111-122 (2001)
- Related Report
  2001 Annual Research Report
[Publications] S.Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8^<th> International Symposium on String Processing and Information Retrieval. 96-110 (2001)
- Related Report
  2001 Annual Research Report
[Publications] H.Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Proc. 12^<th> Annual International Symposium on Algorithms and Computation. 719-730 (2001)
- Related Report
  2001 Annual Research Report
[Publications] K.Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artincial Intelligence. 2226. 413-425 (2001)
- Related Report
  2001 Annual Research Report
[Publications] M.Takeda: "String Resemblance System : A Unifying Framework for String Similarity with Applications to Literature and Music"Lecture Notes in Computer Science. 2089. 147-151 (2001)
- Related Report
  2001 Annual Research Report

データ圧縮に基づく高速テキストマイニング

Principal Investigator

竹田 正幸 九州大学, 大学院・システム情報科学研究院, 助教授 (50216909)

¥2,400,000 (Direct Cost: ¥2,400,000)

Report

Research Products

[Publications] M.Takeda et al.: "Discovering instance of poetic allusion from anthologies of classical Japanese poems"Theoretical Computer Science. 292(2). 497-524 (2003)

Related Report

[Publications] M.Takeda et al.: "Discovering charactersitic expressions from literary works"Theoretical Computer Science. 292(2). 525-546 (2003)

Related Report

[Publications] Y.Hayashi et al.: "Uniform characterization of polynomial-query learnabilities"Theoretical Computer Science. 292(2). 377-385 (2003)

Related Report

[Publications] M.Hirao et al.: "A practical algorithm to find the best subsequences patterns"Theoretical Computer Science. 292(2). 465-79 (2003)

Related Report

[Publications] T.Kida et al.: "A unifying framework for compressed pattern matching"Theoretical Computer Science. (to appear).

Related Report

[Publications] H.Bannai et al.: "A String Pattern Regression algorithm and Its Application to Pattern Discovery in Long Introns"Genome Informatics. 13. 3-11 (2002)

Related Report

[Publications] S.Inenaga et al.: "Discovering Best Variable-Length-Don't-Care Patterns"Lecture Notes in Artificial Intelligence. 2534. 86-97 (2002)

Related Report

[Publications] K.Baba et al.: "A note on Randomized Algorithm for String Matching with Mismatches"Proc.The Prague Stringology Conference '02(PSC'02). 9-17 (2002)

Related Report

[Publications] S.Inenaga et al.: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476. 310-324 (2002)

Related Report

[Publications] M.Takeda et al.: "Processing Text Files as Is : Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts"Lecture Notes in Computer Science. 2476. 170-186 (2002)

Related Report

[Publications] S.Inenaga et al.: "Space-Economical Construction of Index Structures for All-Suffixes of a String"Lecture Notes in Computer Science. 2420. 341-352 (2002)

Related Report

[Publications] S.Inenaga et al.: "The Minimum DAWG for All Suffixes of a String and Its Applications"Lecture Notes in Computer Science. 2373. 151-165 (2002)

Related Report

[Publications] M.Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)

Related Report

[Publications] T.Kadota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8^<th> International Symposium on String Processing and Information Retrieval. 111-122 (2001)

Related Report

[Publications] S.Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8^<th> International Symposium on String Processing and Information Retrieval. 96-110 (2001)

Related Report

[Publications] H.Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Proc. 12^<th> Annual International Symposium on Algorithms and Computation. 719-730 (2001)

Related Report

[Publications] K.Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artincial Intelligence. 2226. 413-425 (2001)

Related Report

[Publications] M.Takeda: "String Resemblance System : A Unifying Framework for String Similarity with Applications to Literature and Music"Lecture Notes in Computer Science. 2089. 147-151 (2001)

Related Report

竹田正幸九州大学, 大学院・システム情報科学研究院, 助教授 (50216909)