Pattern Discovery from Large Text Data Based on the Property of Languages Being Scale-Free

Research Project

Project/Area Number	19700150
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Single-year Grants
Research Field	Intelligent informatics
Research Institution	Kyushu University
Principal Investigator	IKEDA Daisuke Kyushu University, 大学院・システム情報科学研究院, 准教授 (00294992)
Project Period (FY)	2007 – 2008
Project Status	Completed (Fiscal Year 2008)
Budget Amount *help	¥3,750,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥450,000) Fiscal Year 2008: ¥1,950,000 (Direct Cost: ¥1,500,000、Indirect Cost: ¥450,000) Fiscal Year 2007: ¥1,800,000 (Direct Cost: ¥1,800,000)
Keywords	知識発見とデータマイニング / テキストマイニング / 部分文字列による頻度推定 / 背景集合を用いたマイニング / 例外文字列発見 / スパム検出 / ワードサラダ / Zスコア / 接尾辞木
Research Abstract	本研究の大目標は、スケールフリー性を利用し、言語や対象領域に依存しないテキストマイニングの手法を確立することである。これに対し、可変長の文字列の組み合わせでパターンを発見する手法を2つ提案し、その有効性を実験により示した。最初の手法で用いるパターンは、複数の可変長部分文字列が重複を持って重なっている。この手法により、従来は困難だったワードサラダと呼ばれる人工的に生成されたスパムを検出できるようになった。この手法は、普通の頻度分布と異なる部分を抽出するという意味で従来よく用いられてきた標準正規分布からのずれ(z-score)を用いた手法に近い。一方で、データマイニングの分野で研究されてきた例外パターン発見の枠組みをテキストに応用し、z-scoreでは見つけられなかったパターンを発見できることを、DNA配列を用いた実験により示した。

Report

(3 results)

2008 Annual Research Report Final Research Report ( PDF )
2007 Annual Research Report

Research Products
(5 results)

All 2009 2008

All Journal Article (1 results) Presentation (4 results)

[Journal Article] Unsupervised Spam Detection by Document Complexity Estimation2008
- Author(s)
  Takashi Uemura, Daisuke Ikeda and Hiroki Arimura
- Journal Title
  
  Proceedings of the llth Inernational Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Springer-Verlag Vol. 5255
  
  Pages: 319-331
- Related Report
  2008 Final Research Report
[Presentation] CF-Suffix Trieを用いた頻出移動パターンマイニング手法2009
- Author(s)
  稲田泰裕,池田大輔,鈴木英之進
- Organizer
  第9回データマイニングと統計数理研究会
- Place of Presentation
  京都
- Year and Date
  2009-03-03
- Related Report
  2008 Final Research Report
[Presentation] 時系列データマイニングによる動的ヘテロなシステムからの知識発見-宇宙天気研究における大規模帰納処理システム構築へ向けて2009
- Author(s)
  徳永旭将、中村和幸、樋口知之、池田大輔、大久保翔、藤本昌子、吉川顕正、湯元清文、MAGDAS/CPMNグループ湯元清文
- Organizer
  日本地球惑星科学連合2009年大会
- Related Report
  2008 Final Research Report
[Presentation] Unsupervised Spam Detection by Document Complexity Estimation2008
- Author(s)
  Uemura, Ikeda, and Arimura
- Organizer
  Discovery Science
- Place of Presentation
  ブダペスト(ハンガリー)
- Year and Date
  2008-10-16
- Related Report
  2008 Annual Research Report
[Presentation] Unsupervised Spam Detection by Document Complexity Estimation2008
- Author(s)
  Takashi Uemura, Daisuke Ikeda and Hiroki Arimura
- Organizer
  Proceedings of the llth Inernational Conference on Discovery Science, Lecture Notes in Artificial Intelligence, Springer-Verlag
- Related Report
  2008 Final Research Report

Pattern Discovery from Large Text Data Based on the Property of Languages Being Scale-Free

Principal Investigator

IKEDA Daisuke Kyushu University, 大学院・システム情報科学研究院, 准教授 (00294992)

¥3,750,000 (Direct Cost: ¥3,300,000、Indirect Cost: ¥450,000)

Report

Research Products

[Journal Article] Unsupervised Spam Detection by Document Complexity Estimation2008

Author(s)

Journal Title

Related Report

[Presentation] CF-Suffix Trieを用いた頻出移動パターンマイニング手法2009

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] 時系列データマイニングによる動的ヘテロなシステムからの知識発見-宇宙天気研究における大規模帰納処理システム構築へ向けて2009

Author(s)

Organizer

Related Report

[Presentation] Unsupervised Spam Detection by Document Complexity Estimation2008

Author(s)

Organizer

Place of Presentation

Year and Date

Related Report

[Presentation] Unsupervised Spam Detection by Document Complexity Estimation2008

Author(s)

Organizer

Related Report