• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2003 Fiscal Year Final Research Report Summary

Development of Intelligent full text retrieval system based on data compression and fast string pattern matching algorithms

Research Project

Project/Area Number 13558029
Research Category

Grant-in-Aid for Scientific Research (B)

Allocation TypeSingle-year Grants
Section展開研究
Research Field 計算機科学
Research InstitutionKyushu University

Principal Investigator

SHINOHARA Ayumi  Kyushu University, Department of Informatics, Ass.Prof., 大学院・システム情報科学研究院, 助教授 (00226151)

Co-Investigator(Kenkyū-buntansha) KIDA Takuya  Kyushu University Library, lecturer, 附属図書館, 講師 (70343316)
SAKAMOTO Hiroshi  Kyushu Institute of Technology, Faculty of Computer Science and Systems Engineering, Ass.Prof., 情報工学部, 助教授 (50315123)
TAKEDA Masayuki  Kyushu University, Department of Informatics, Ass.Prof., 大学院・システム情報科学研究院, 助教授 (50216909)
SHIMOZONO Shinichi  Kyushu Institute of Technology, Faculty of Computer Science and Systems Engineering, Ass.Prof., 情報工学部, 助教授 (70243988)
Project Period (FY) 2001 – 2003
KeywordsPattern matching algorithm / Data compression / Full-text retrieval system / Knowledge discovery / Optimal pattern discovery / Suffix tree / Indexing structure / Machine learning
Research Abstract

Suffix trees and Directed Acyclic Word Graphs(DAWGs) are well-known data structures as efficient indexingstructures for strings. We focus on Compact Directed Acyclic Word Graphs(CDAWGs) which are more compact indexing structures, and showed online construction algorithms for them. We also showed an online construction algorithm for an indexing structure consists of every DAWGs for all prefixes of given strings, and proved a lower-bound of the number of states of subsequence automata accepting all subsequences of a given string. We then introduced a new implementation technique based on ternary trees for DAWGs, which balances space efficiency and search time for a large alphabet, such as Japanese texts.
We proposed an inverse problem in which we infer an original string from a given unlabelled graph corresponding to the indexing structures of the string. We showed linear-time algorithms for DAWG, subsequence automata, and suffix arrays in this setting. Moreover, we succeeded to prove a tight upper-bound of the length of solutions of world equations containing one variable.
Concerning with data compression, we showed a space, efficient algorithm which outputs a compact context-free grammar representing a given string, and proved its approximation ratio. We also showed a linear-time compression algorithm using longest first replacement heuristics.
In order to find patterns from large database in reasonable time, we developed several algorithms for classes of generalized patterns. Especially, we proposed an efficient pattern discovery algorithm in which we allow small mismatches of the pattern with data, and verified that it is practical by a series of computational experiments.

  • Research Products

    (62 results)

All Other

All Publications (62 results)

  • [Publications] Zdenek Tronicek et al.: "The Size of Subsequence Automaton"Lecture Notes in Computer Science. 2857. 304-310 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "Linear-Time Off-Line Text Compression by Longest First Substitution"Lecture Notes in Computer Science. 2857. 137-152 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Masayuki Takeda et al.: "Discovering Most Classificatory Patterns for Very Expressive Pattern Classes"Lecture Notes in Computer Science. 2843. 486-493 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Makoto Toyomasu et al.: "Developing Dynamic Gaits for Four Legged Robots"Proc.International Symposium on Information Science and Electrical Engineering 2003. 577-580 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hideo Bannai et al.: "Inferring Strings from Graphs and Arrays"Lecture Notes in Computer Science. 2747. 208-217 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Kensuke Baba et al.: "On the Length of the Minimum Solution of Word Equations in One Variable"Lecture Notes in Computer Science. 2747. 189-197 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Satoru Miyamoto et al.: "Ternary Directed Acyclic Word Graphs"Lecture Notes in Computer Science. 2843. 486-493 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hiroshi Sakamoto: "A Fully Linear-Time Approximation Algorithm for Grammar-Based Compression"Proc.14th Annual Symposium on Combinatorial Pattern Matching (CPM 2003). 348-360 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Kensuke Baba et al.: "A Note on Randomized Algorithm for String Matching with Mismatches"Nordic Journal of Computing. Vol.10. 2-10 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Takuya Kida et al.: "Collage system : A unifying framework for compressed pattern matching"Theoretical Computer Science. Vol.298. 253-272 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Masahiro Hirao et al.: "A practical algorithm to find the best subsequences patterns"Theoretical Computer Science. Vol.292. 465-479 (2003)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hideo Bannai et al.: "A String Pattern Regression Algorithm and Its Application to Pattern Discovery in Long Introns"In Genome Informatics (GIW2002). Vol.13. 3-11 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "Discovering Best Variable-Length-Don't-Care Patterns"Lecture Notes in Computer Science. 2534. 86-97 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Kensuke Baba et al.: "A Note on Randomized Algorithm for String Matching with Mismatches"Proc.The Prague Stringology Conference '02 (PSC'02). 29-30 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476. 310-324 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Masayuki Takeda et al.: "Processing Text Files as Is : Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts"Lecture Notes in Computer Science. 2476. 170-186 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "Space-Economical Construction of Index Structures for All Suffixes of a String"Lecture Notes in Computer Science. 2420. 341-352 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "The Minimum DAWG for All Suffixes of a String and its Applications"Lecture Notes in Computer Science. 2373. 153-167 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Ayumi Shinohara et al.: "Finding Best Patterns Practically"Lecture Notes in Artificial Intelligence. 2281. 307-317 (2002)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hideo Bannai et al.: "More Speed and More Pattern Variations for Knowledge Discovery System BONSAI"In Genome Informatics (GIW2001). Vol.12. 454-455 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hideaki Hori et al.: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Lecture Notes in Computer Science. 2223. 719-730 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Koichiro Yamamoto et al.: "Discovering repetitive expressions and affinities from anthologies of classical Japanese poems"Lecture Notes in Artificial Intelligence. 2226. 416-428 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Masahiro Hirao et al.: "A practical algorithm to find the best episode patterns"Lecture Notes in Artificial Intelligence. 2226. 435-440 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] T.Kadota et al.: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc.8th Symposium on String Processing and Information Retrieval (SPIRE2001). 111-122 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc.8th Symposium on String Processing and Information Retrieval (SPIRE2001). 96-110 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "Construction of the CDAWG for a Trie"Proc.the Prague Stringology Conference '01 (PSC'01). 37-48 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Shunsuke Inenaga et al.: "On-Line Construction of Compact Directed Acyclic Word Graphs"Lecture Notes in Computer Science. 2089. 169-180 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Takuya Kida et al.: "multiple Pattern Matching Algorithms on Collage System"Lecture Notes in Computer Science. 2089. 193-206 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hiroshi Sakamoto et al.: "Extracting Partial Structures from HTML Documents"Proc.14th International FLAIRS Conference : Knowledge Discovery and Data Mining. 264-268 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Katsuaki Taniguchi et al.: "Mining Semi-Structured Data by Path Expressions"Lecture Notes in Artificial Intelligence. 2226. 378-388 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Hiroki Arimura et al.: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089. 152-156 (2001)

    • Description
      「研究成果報告書概要(和文)」より
  • [Publications] Zdenek Tronicek, Ayumi Shinohara: "The Size of Subsequence Automaton"Lecture Notes in Computer Science. 2857(SPIRE 2003). 304-310 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Takashi Funamotp, Masayuki Takeda, Ayumi Shinohara: "Linear-Time Off-Line Text Compression by Longest-First Substitution"Lecture Notes in Computer Science. 2857(SPIRE 2003). 137-152 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Masayuki Takeda, Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Setsuo Arikawa: "Discovering Most Classificatory Patterns for Very Expressive Pattern Glasses"Lecture Notes in Computer Science. 2843(DS 2003). 486-493 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Makoto Toyomasu, Ayumi Shinohara: "Developing Dynamic Gaits for Four Legged Robots"Proc.International Symposium on Information Science and Electrical Engineering. 2003. 577-580 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hideo Bannai, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda: "Inferring Strings from Graphs and Arrays"Lecture Notes in Computer Science. 2747(MFCS2003). 208-217 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Kensuke Baba, Satoshi Tsuruta, Ayumi Shinohara, Masayuki Takeda: "On the Length of the Minimum Solution of Word Equations in One Variable"Lecture Notes in Computer Science. 2747(MFCS2003). 189-197 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda, Ayumi Shinohara: "Ternary Directed Acyclic Word Graphs"Lecture Notes in Computer Science. 2843(CIAA2003). 486-493 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hiroshi Sakamoto: "A Fully Linear-Time Approximation Algorithm for Grammar-Based Compression"Proc.14th Annual Symposium on Combinatorial Pattern Matching. (CPM 2003). 348-360 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Kensuke Babai, Ayumi Shinohara, Masayuki Takeda, Shunsuke Inenaga, Setsuo Arikawa: "A Note on Randomized Algorithm for String Matching with Mismatches"Nordic Journal of Computing. Vol.10. 2-10 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Takuya Kida, Tetsuya Matsumoto, Y.Shibata, Masayuki Takeda, Ayumi Shinohara, Setsuo Arikawa: "Collage system : A unifying framework for compressed pattern matching"Theoretical Computer Science. Vol.298, Isse 1. 253-272 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Masahiro Hirao, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "A practical algorithm to find the best subsequences patterns"Theoretical Computer Science. Vol.292, Isse 2. 465-479 (2003)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hideo Bannai, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Satoru Miyano: "A String Pattern Regression Algorithm and Its Application to Pattern Discovery in Long Introns"In Genome Informatics. Vol.13, (GIW2002). 3-11 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Hideo Bannai, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "Discovering Best Variable-Length-Don't-Care Patterns"Lecture Notes in Computer Science. 2534(DS2002). 86-97 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Kensuke Baba, Ayumi Shinohara, Masayuki Takeda, Shunsuke Inenaga, Setsuo Arikawa: "A Note on Randomized Algorithm for String Matching with Mismatches"Proc.The Prague Stringology Conference '02. (PSC'02). 29-30 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "Compact Directed Acyclic Word Graphs for a Sliding Window"Lecture Notes in Computer Science. 2476(SPIRE2002). 310-324 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Masayuki Takeda, Satoru Miyamoto, Takuya Kida, Ayumi Shinohara, Shuichi Fukamachi, Takeshi Shinohara, Setsuo Arikawa: "Processing Text Files as Is : Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi-Structured Texts"Lecture Notes in Computer Science. 2476(SPIRE2002). 170-186 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Hideo Bannai, Setsuo Arikawa: "Space-Economical Construction of Index Structures for All Suffixes of a String"Lecture Notes in Computer Science. 2420(MFCS2002). 341-352 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Masayuki Takeda, Ayumi Shinohara, Hiromasa Hoshino, Setsuo Arikawa: "The Minimum DAWG for All Suffixes of a String and its Applications"Lecture Notes in Computer Science. 2373(CPM2002). 153-167 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa, Masahiro Hirao, Hiromasa Hoshino, Shunsuke Inenaga: "Finding Best Patterns Practically"Lecture Notes in Artificial Intelligence(Final Report of the Japanese Discovery Science Project). 2281. 307-317 (2002)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hideo Bannai, Keisuke Iida, Ayumi Shinohara, Masayuki Takeda, Satoru Miyano: "More Speed and More Pattern Variations for Knowledge Discovery System BONSA"In Genome Informatics. Vol.12(GIW2001). 454-455 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hideaki Hori, Shinichi Shimozono, Masayuki Takeda, Ayumi Shinohara: "Fragmentary Pattern Matching : Complexity, Algorithms and Applications for Analyzing Classic Literary Works"Lecture Notes in Computer Science. 2223(ISAAC'01). 719-730 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Koichiro Yamamoto, Masayuki Takeda, Ayumi Shinohara, Tomoko Fukuda, Ichiro Nanri: "Discovering repetitive expressions and affinities from anthologies of classical Japanese poems"Lecture Notes in Artificial Intelligence. 2226(DS2001). 416-428 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Masahiro Hirao, Shunsuke Inenaga, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "A practical algorithm to find the best episode patterns"Lecture Notes in Artificial Intelligence. 2226(DS2001). 435-440 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] T.Kadota, Masahiro Hirao, A.Ishino, Masayuki Takeda, Ayumi Shinohara, F.Matsuo: "Musical Sequence Comparison for Melodic and Rhythmic Similarities"Proc.8th Symposium on String Processing and Information Retrieval. (SPIRE2001). 111-122 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa, Giancarlo Mauri, Giulio Pavesi: "On-Line Construction of Symmetric Compact Directed Acyclic Word Graphs"Proc.8th Symposium on String Processing and Information Retrieval. (SPIRE2001). 96-110 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "Construction of the CDAWG for a Trie"Proc.the Prague Stringology Conference '01. (PSC'01). 37-48 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Shunsuke Inenaga, Hiromasa Hoshino, Ayumi Shinohara, Masayuki Takeda, Setsuo Arikawa: "On-Line Construction of Compact Directed Acyclic Word Graphs"Lecture Notes in Computer Science. 2089(CPM2001). 169-180 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Takuya Kida, Tetsuya Matsumoto, Masayuki Takeda, Ayumi Shinohara, Setsuo Arikawa: "Multiple Pattern Matching Algorithms on Collage System"Lecture Notes in Computer Science. 2089(CPM2001). 193-206 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hiroshi Sakamoto, Hiroki Arimura, Setsuo Arikawa: "Extracting Partial Structures from HTML Documents"Proc.14th International FLAIRS Conference : Knowledge Discovery and Data Mining. 264-268 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Katsuaki Taniguchi, Hiroshi Sakamoto, Hiroki Arimura, Sinich Simozono, Setsuo Arikawa: "Mining Semi-Structured Data by Path Expressions^"Lecture Notes in Artificial Intelligence. 2226(DS2001). 378-388 (2001)

    • Description
      「研究成果報告書概要(欧文)」より
  • [Publications] Hiroki Arimura, Hiroki Asaka, Hiroshi Sakamoto, Setsuo Arikawa: "Efficient Discovery of Proximity Patterns with Suffix Arrays"Lecture Notes in Computer Science. 2089(CPM2001). 152-156 (2001)

    • Description
      「研究成果報告書概要(欧文)」より

URL: 

Published: 2005-04-19  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi