Project/Area Number |
11558040
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 展開研究 |
Research Field |
Intelligent informatics
|
Research Institution | Kyushu University |
Principal Investigator |
ARIMURA Hiroki Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (20222763)
|
Co-Investigator(Kenkyū-buntansha) |
SHINOHARA Ayumi Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (00226151)
TAKEDA Masayuki Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (50216909)
SHOUDAI Takayoshi Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (50226304)
HIRATA Kouichi Kyushu Institute of Technology, Department of Artificial Intelligence, Assoc. Prof., 情報工学部, 助教授 (20274558)
ISHINO Akira Department of Informatics, Kyushu University, Res. Assoc., 大学院・システム情報科学研究院, 助手 (10315129)
|
Project Period (FY) |
1999 – 2001
|
Keywords | Web Mining / Semi-structured data / HTML / XML / Information extraction / Machine learning / Data compression / Pattern matching |
Research Abstract |
The goal of this research project is to devise an efficient semi-automatic tool that supports human discovery from large unstructured and semi-structured text data. To achieve this goal, we studied in the following three directions. 1. The central process of text mining is pattern discovery. We employed the framework of optimized pattern discovery, and developed effcient and robust text mining algorithms that find simple combinatorial patterns from large unstructured texts. To implement these algorithms, we developed a text index structure based on the suffix arrays suitable for text mining. Based on these technologies, we implemented a prototype system and run computer experiments on Web data. 2. Another important technology for text is efficient pattern matching. As a theoretical framework, we proposed a unified framework, called Collage system, for realizing various dictionary-based compression methods. We developed both Knuth-Morris-Pratt type and Byer-Moore type pattern matching algorithms employing this framework. We also applied this framework to Byte-Pair-Encoding compression method and Sequitur, the former of which yields the fastest compressed pattern matching algorithm. 3. Final process of text mining is information extraction. From theoretical point of view, we first formalize the information extraction problem from semi-structured data, and then gave theoretical analysis of the power and the limitation of such tasks. Then, we developed efficient information extraction algorithms for various types of extraction rules including tree wrappers and hedge patterns and evaluate them through experiments on real-life semi-structured data on the internet.
|