半構造化データに対する文字列処理の高速化に関する研究

Research Project

Project/Area Number	14780224
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Single-year Grants
Research Field	計算機科学
Research Institution	Hokkaido University (2004) Kyushu University (2002-2003)
Principal Investigator	喜田拓也北海道大学, 大学院・情報科学研究科, 助教授 (70343316)
Project Period (FY)	2002 – 2004
Project Status	Completed (Fiscal Year 2004)
Budget Amount *help	¥3,000,000 (Direct Cost: ¥3,000,000) Fiscal Year 2004: ¥600,000 (Direct Cost: ¥600,000) Fiscal Year 2003: ¥1,600,000 (Direct Cost: ¥1,600,000) Fiscal Year 2002: ¥800,000 (Direct Cost: ¥800,000)
Keywords	文字列照合 / 半構造化データ / HTML, XML / データ圧縮 / Pattern Matching / String Processing / 文法変換に基づく圧縮 / オントロジー / HTML,XML / VLDCパタン / Hamming Distance / Aho-Corasick照合機械 / 日本語文宇列照合
Research Abstract	WWW上で広く用いられているHTMLファイルは,タグを単位とした木構造を内部表現に持つ半構造化データである.ポストHTMLとして登場し,今日ではアプリケーション間のデータ交換のための共通形式として注目を浴びているXMLファイルも同様の半構造化データである. これまで半構造化データに対する文字列処理といえば,一度テキストから木構造を抽出し,それを土台にしてタグの要素であるテキストに対して形態素解析を行ったり,部分文字列やN-gramを切り出したりした後に索引構造を構築し,それを用いて文字列照合などの処理を行うのが主であった.しかしながら,この方法では索引構造を構築するために時間がかかるうえ,元データの変更毎に再構築する必要がある. 本研究では,索引構造を用いずに半構造化データに対しそのまま文字列処理をする手法の開発が目的である.そのために,半構造化データに対して必要とされる文字列照合操作を明らかにし,各操作についてより高速なアルゴリズムを開発する.例えば,半構造化データに対する検索要求としては,ある特定の階層構造の下にあるデータのみを対象とした文字列照合を行ったり,XMLファイル中の特殊な名前を持つタグを探し出したりすることが考えられる.このような照合操作の実現は,タグやデータの高速な置換処理や,大規模テキストからの高速なデータマイニング等への応用につながる. 本年度は,半構造データに対する文字列照合に適したデータ圧縮法の枠組みの一つである文法変換に基づくデータ圧縮法について取り組み,理論的に少スペースかつ線形時間でデータを圧縮する手法を提案した.また,半構造化データに対する文字列処理の一つの応用として,オントロジーを考慮した文字列処理という新しい問題について取り組み,各種オントロジーデータのうち分類階層データを考慮した文字列照合についてのアルゴリズムを開発した.

Report

(3 results)

Research Products

(5 results)

All 2005 2004 Other

All Journal Article (3 results) Publications (2 results)

[Journal Article] データストリームのためのマイニング技術2005
- Author(s)
  有村博紀, 喜田拓也
- Journal Title
  
  特集「データマイニング技術」,情報処理(鈴木英之進, 鹿島久嗣(編)) Vol.46(1)
  
  Pages: 4-11
- NAID
  110002768327
- Related Report
  2004 Annual Research Report
[Journal Article] Pattern Matching with Taxonomic Information2004
- Author(s)
  T.Kida, H.Arimura
- Journal Title
  
  Proceedings of Asia Information Retrieval Symposium (AIRS2004)
  
  Pages: 265-268
- NAID
  120000959147
- Related Report
  2004 Annual Research Report
[Journal Article] A Space-Saving Linear-Time Algorithm for Grammar-Based Compression2004
- Author(s)
  H.Sakamoto, T.Kida, S.Shimozono
- Journal Title
  
  Proceedings of the 11th Symposium on String Processing and Information Retrieval (SPIRE2004) LNCS3246
  
  Pages: 218-229
- NAID
  110003178856
- Related Report
  2004 Annual Research Report
[Publications] T.Kida, et al.: "Collage system : A unifying framework for compressed pattern matching"Theoretical Computer Science. 298. 253-272 (2003)
- Related Report
  2003 Annual Research Report
[Publications] Masayuki Takeda, Satoru Miyamoto, Takuya Kida, et al.: "Processing Text Files as Is : Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts"Proc. 9th International Symposium on String Processing and Information Retrieval. LNCS2476. 170-186 (2002)
- Related Report
  2002 Annual Research Report

半構造化データに対する文字列処理の高速化に関する研究

Principal Investigator

喜田 拓也 北海道大学, 大学院・情報科学研究科, 助教授 (70343316)

¥3,000,000 (Direct Cost: ¥3,000,000)

Report

Research Products

[Journal Article] データストリームのためのマイニング技術2005

Author(s)

Journal Title

NAID

Related Report

[Journal Article] Pattern Matching with Taxonomic Information2004

Author(s)

Journal Title

NAID

Related Report

[Journal Article] A Space-Saving Linear-Time Algorithm for Grammar-Based Compression2004

Author(s)

Journal Title

NAID

Related Report

[Publications] T.Kida, et al.: "Collage system : A unifying framework for compressed pattern matching"Theoretical Computer Science. 298. 253-272 (2003)

Related Report

Related Report

喜田拓也北海道大学, 大学院・情報科学研究科, 助教授 (70343316)