2001 Fiscal Year Final Research Report Summary

Development of Efficient Data Mining Systems for Large Semi-Structured Text Data

Research Project

Project/Area Number	11558040
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	展開研究
Research Field	Intelligent informatics
Research Institution	Kyushu University
Principal Investigator	ARIMURA Hiroki Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (20222763)
Co-Investigator(Kenkyū-buntansha)	SHINOHARA Ayumi Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (00226151) TAKEDA Masayuki Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (50216909) SHOUDAI Takayoshi Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (50226304) HIRATA Kouichi Kyushu Institute of Technology, Department of Artificial Intelligence, Assoc. Prof., 情報工学部, 助教授 (20274558) ISHINO Akira Department of Informatics, Kyushu University, Res. Assoc., 大学院・システム情報科学研究院, 助手 (10315129)
Project Period (FY)	1999 – 2001
Keywords	Web Mining / Semi-structured data / HTML / XML / Information extraction / Machine learning / Data compression / Pattern matching
Research Abstract	The goal of this research project is to devise an efficient semi-automatic tool that supports human discovery from large unstructured and semi-structured text data. To achieve this goal, we studied in the following three directions. 1. The central process of text mining is pattern discovery. We employed the framework of optimized pattern discovery, and developed effcient and robust text mining algorithms that find simple combinatorial patterns from large unstructured texts. To implement these algorithms, we developed a text index structure based on the suffix arrays suitable for text mining. Based on these technologies, we implemented a prototype system and run computer experiments on Web data. 2. Another important technology for text is efficient pattern matching. As a theoretical framework, we proposed a unified framework, called Collage system, for realizing various dictionary-based compression methods. We developed both Knuth-Morris-Pratt type and Byer-Moore type pattern matching algorithms employing this framework. We also applied this framework to Byte-Pair-Encoding compression method and Sequitur, the former of which yields the fastest compressed pattern matching algorithm. 3. Final process of text mining is information extraction. From theoretical point of view, we first formalize the information extraction problem from semi-structured data, and then gave theoretical analysis of the power and the limitation of such tasks. Then, we developed efficient information extraction algorithms for various types of extraction rules including tree wrappers and hedge patterns and evaluate them through experiments on real-life semi-structured data on the internet.

Research Products
(34 results)

All Other

All Publications (34 results)

[Publications] H.Arimura et al.: "Efficient Learning of Semi-Structured Data from Queries"Lecture Notes in Artificial Intelligence. 2225. 315-331 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] M.Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)
- Description
  「研究成果報告書概要(和文)」より
[Publications] T.Shoudai et al.: "Polynomial Time Algorithms for Finding Unordered Tree Patterns with Internal Variables"Lecture Notes in Computer Science. 2138. 335-346 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] K.Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artificial Intelligence. 2226. 413-425 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] A.Yamamoto et al.: "Deductive and Inductive Reasoning on Semi-Structured Documents Modeled with Hedges"Lecture Notes in Artificial Intelligence. 2157. 140-147 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] K.Hirata et al.: "Prediction-Preserving Reducibility with Membership Queries on Formal Languages"Lecture Notes in Computer Science. 2138. 172-183 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] T.Kodota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8th International Symposium on String Processing and Information Retrieval (SPJRE2001). 111-122 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] S.Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8th International Symposium on String Processing and Information Retrieval (SPJRE2001). 96-110 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] H.Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms aid Applications for Analyzing Classic Literary Works"Proc. 12th Annual International Symposium on Algorithms and Computation (ISAAC' 01). 719-730 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] M.Takeda: "String resemblace system : A unifying framework for string similarity with applications to literature and music"Lecture Notes in Computer Science. 2089. 147-151 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] T.Kida et al.: "Multiple pattern matching algorithms on collage system"Lecture Notes in Computer Science. 2089. 193-206 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Tetsuya Nasukawa et al.: "Base Technology for Text Mining"Journal of Japanese Society for Artificial Intelligence. 16(2). 201-211 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Hiroshi Sakamoto et al.: "Web Mining"Journal of Japanese Society for Artificial Intelligence. 16(2). 233-238 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Hiroshi Sakamoto et al.: "Extracting Partial Structures from HTML Documents"Proc. the 14th Florida Artificial Intelligence Research Symposium (FLAIRS' 2001). 264-268 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Hiroki Arimura et al.: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089. 152-156 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Toru Kasai et al.: "Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications"Lecture Notes in Computer Science. 2089. 181-192 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] Katsuaki Taniguchi et al.: "Mining Semi-Structured Data by Path Expressions"Lecture Notes in Artificial Intelligence. 2226. 378-388 (2001)
- Description
  「研究成果報告書概要(和文)」より
[Publications] H. Arimura et al.: "Efficient Learning of Semi-Structured Data from Queries"Lecture Notes in Artificial Intelligence. 2225. 315-331 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] M. Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] T. Shoudai et al.: "Polynomial Time Algorithms for Finding Unordered Tree Patterns with Internal Variables"Lecture Notes in Computer Science. 2138. 335-346 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] K. Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artificial Intelligence. 2226. 413-425 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] A. Yamamoto et al.: "Deductive and Inductive Reasoning on Semi-Structured Documents Modeled with Hedges"Lecture Notes in Artificial Intelligence. 2157. 140-147 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] K. Hirata et al.: "Prediction-Preserving Reducibility with Membership Queries on Formal Languages"Lecture Notes in Computer Science. 2138. 172-183 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] T. Kadota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8th International Symposium on String Processing and Information Retrieval (SPIRE2001). 111-122 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] S. Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8th International Symposium on String Processing and Information Retrieval (SPIRE2001). 96-110 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] H. Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Proc. 12th Annual International Symposium on Algorithms and Computation (ISAAC'01). 719-730 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] M. Takeda: "String resemblance system : A unifying framework for string similarity with applications to literature and music"Lecture Notes in Computer Science. 2089. 147-151 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] T. Kida et al.: "Multiple pattern matching algorithms on college system"Lecture Notes in Computer Science. 2089. 193-206 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Tetsuya Nasukawa et al.: "Base Technology for Text Mining"Journal of Japanese Society for Artificial Intelligence. 16 (2). 201-211 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Hiroshi Sakamoto et al.: "Web Mining"Journal of Japanese Society for Artificial Intelligence. 16 (2). 233-238 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Hiroshi Sakamoto et al.: "Extracting Partial Structures from HTML Documents"Proc. the 14th Florida Artificial Intelligence Research Symposium (FLAIRS'2001). 264-268 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Hiroki Arimura et al.: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089. 152-256 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Toru Kasai et al.: "Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications"Lecture Notes in Computer Science. 2089. 181-192 (2001)
- Description
  「研究成果報告書概要(欧文)」より
[Publications] Katsuaki Taniguchi et al.: "Mining Semi-Structured Data by Path Expressions"Lecture Notes in Artificial Intelligence. 2226. 378-388 (2001)
- Description
  「研究成果報告書概要(欧文)」より

2001 Fiscal Year Final Research Report Summary

Development of Efficient Data Mining Systems for Large Semi-Structured Text Data

Principal Investigator

ARIMURA Hiroki Department of Informatics, Kyushu University, Assoc. Prof., 大学院・システム情報科学研究院, 助教授 (20222763)

Research Products

[Publications] H.Arimura et al.: "Efficient Learning of Semi-Structured Data from Queries"Lecture Notes in Artificial Intelligence. 2225. 315-331 (2001)

Description

[Publications] M.Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)

Description

[Publications] T.Shoudai et al.: "Polynomial Time Algorithms for Finding Unordered Tree Patterns with Internal Variables"Lecture Notes in Computer Science. 2138. 335-346 (2001)

Description

[Publications] K.Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artificial Intelligence. 2226. 413-425 (2001)

Description

[Publications] A.Yamamoto et al.: "Deductive and Inductive Reasoning on Semi-Structured Documents Modeled with Hedges"Lecture Notes in Artificial Intelligence. 2157. 140-147 (2001)

Description

[Publications] K.Hirata et al.: "Prediction-Preserving Reducibility with Membership Queries on Formal Languages"Lecture Notes in Computer Science. 2138. 172-183 (2001)

Description

[Publications] T.Kodota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8th International Symposium on String Processing and Information Retrieval (SPJRE2001). 111-122 (2001)

Description

[Publications] S.Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8th International Symposium on String Processing and Information Retrieval (SPJRE2001). 96-110 (2001)

Description

[Publications] H.Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms aid Applications for Analyzing Classic Literary Works"Proc. 12th Annual International Symposium on Algorithms and Computation (ISAAC' 01). 719-730 (2001)

Description

[Publications] M.Takeda: "String resemblace system : A unifying framework for string similarity with applications to literature and music"Lecture Notes in Computer Science. 2089. 147-151 (2001)

Description

[Publications] T.Kida et al.: "Multiple pattern matching algorithms on collage system"Lecture Notes in Computer Science. 2089. 193-206 (2001)

Description

[Publications] Tetsuya Nasukawa et al.: "Base Technology for Text Mining"Journal of Japanese Society for Artificial Intelligence. 16(2). 201-211 (2001)

Description

[Publications] Hiroshi Sakamoto et al.: "Web Mining"Journal of Japanese Society for Artificial Intelligence. 16(2). 233-238 (2001)

Description

[Publications] Hiroshi Sakamoto et al.: "Extracting Partial Structures from HTML Documents"Proc. the 14th Florida Artificial Intelligence Research Symposium (FLAIRS' 2001). 264-268 (2001)

Description

[Publications] Hiroki Arimura et al.: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089. 152-156 (2001)

Description

[Publications] Toru Kasai et al.: "Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications"Lecture Notes in Computer Science. 2089. 181-192 (2001)

Description

[Publications] Katsuaki Taniguchi et al.: "Mining Semi-Structured Data by Path Expressions"Lecture Notes in Artificial Intelligence. 2226. 378-388 (2001)

Description

[Publications] H. Arimura et al.: "Efficient Learning of Semi-Structured Data from Queries"Lecture Notes in Artificial Intelligence. 2225. 315-331 (2001)

Description

[Publications] M. Takeda et al.: "Mining from Literary Texts : Pattern Discovery and Similarity Computation"Lecture Notes in Computer Science. 2281. 520-533 (2002)

Description

[Publications] T. Shoudai et al.: "Polynomial Time Algorithms for Finding Unordered Tree Patterns with Internal Variables"Lecture Notes in Computer Science. 2138. 335-346 (2001)

Description

[Publications] K. Yamamoto et al.: "Discovering Repetitive Expressions and Affinities from Anthologies of Classical Japanese Poems"Lecture Notes in Artificial Intelligence. 2226. 413-425 (2001)

Description

[Publications] A. Yamamoto et al.: "Deductive and Inductive Reasoning on Semi-Structured Documents Modeled with Hedges"Lecture Notes in Artificial Intelligence. 2157. 140-147 (2001)

Description

[Publications] K. Hirata et al.: "Prediction-Preserving Reducibility with Membership Queries on Formal Languages"Lecture Notes in Computer Science. 2138. 172-183 (2001)

Description

[Publications] T. Kadota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc. 8th International Symposium on String Processing and Information Retrieval (SPIRE2001). 111-122 (2001)

Description

[Publications] S. Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc. 8th International Symposium on String Processing and Information Retrieval (SPIRE2001). 96-110 (2001)

Description

[Publications] H. Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Proc. 12th Annual International Symposium on Algorithms and Computation (ISAAC'01). 719-730 (2001)

Description

[Publications] M. Takeda: "String resemblance system : A unifying framework for string similarity with applications to literature and music"Lecture Notes in Computer Science. 2089. 147-151 (2001)

Description

[Publications] T. Kida et al.: "Multiple pattern matching algorithms on college system"Lecture Notes in Computer Science. 2089. 193-206 (2001)

Description

[Publications] Tetsuya Nasukawa et al.: "Base Technology for Text Mining"Journal of Japanese Society for Artificial Intelligence. 16 (2). 201-211 (2001)

Description

[Publications] Hiroshi Sakamoto et al.: "Web Mining"Journal of Japanese Society for Artificial Intelligence. 16 (2). 233-238 (2001)

Description

[Publications] Hiroshi Sakamoto et al.: "Extracting Partial Structures from HTML Documents"Proc. the 14th Florida Artificial Intelligence Research Symposium (FLAIRS'2001). 264-268 (2001)

Description

[Publications] Hiroki Arimura et al.: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089. 152-256 (2001)

Description

[Publications] Toru Kasai et al.: "Linear-time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications"Lecture Notes in Computer Science. 2089. 181-192 (2001)

Description

[Publications] Katsuaki Taniguchi et al.: "Mining Semi-Structured Data by Path Expressions"Lecture Notes in Artificial Intelligence. 2226. 378-388 (2001)

Description