Project/Area Number |
13558029
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 展開研究 |
Research Field |
計算機科学
|
Research Institution | Kyushu University |
Principal Investigator |
SHINOHARA Ayumi Kyushu University, Department of Informatics, Ass.Prof., 大学院・システム情報科学研究院, 助教授 (00226151)
|
Co-Investigator(Kenkyū-buntansha) |
KIDA Takuya Kyushu University Library, lecturer, 附属図書館, 講師 (70343316)
SAKAMOTO Hiroshi Kyushu Institute of Technology, Faculty of Computer Science and Systems Engineering, Ass.Prof., 情報工学部, 助教授 (50315123)
TAKEDA Masayuki Kyushu University, Department of Informatics, Ass.Prof., 大学院・システム情報科学研究院, 助教授 (50216909)
SHIMOZONO Shinichi Kyushu Institute of Technology, Faculty of Computer Science and Systems Engineering, Ass.Prof., 情報工学部, 助教授 (70243988)
|
Project Period (FY) |
2001 – 2003
|
Project Status |
Completed (Fiscal Year 2003)
|
Budget Amount *help |
¥11,200,000 (Direct Cost: ¥11,200,000)
Fiscal Year 2003: ¥2,600,000 (Direct Cost: ¥2,600,000)
Fiscal Year 2002: ¥4,000,000 (Direct Cost: ¥4,000,000)
Fiscal Year 2001: ¥4,600,000 (Direct Cost: ¥4,600,000)
|
Keywords | Pattern matching algorithm / Data compression / Full-text retrieval system / Knowledge discovery / Optimal pattern discovery / Suffix tree / Indexing structure / Machine learning |
Research Abstract |
Suffix trees and Directed Acyclic Word Graphs(DAWGs) are well-known data structures as efficient indexingstructures for strings. We focus on Compact Directed Acyclic Word Graphs(CDAWGs) which are more compact indexing structures, and showed online construction algorithms for them. We also showed an online construction algorithm for an indexing structure consists of every DAWGs for all prefixes of given strings, and proved a lower-bound of the number of states of subsequence automata accepting all subsequences of a given string. We then introduced a new implementation technique based on ternary trees for DAWGs, which balances space efficiency and search time for a large alphabet, such as Japanese texts. We proposed an inverse problem in which we infer an original string from a given unlabelled graph corresponding to the indexing structures of the string. We showed linear-time algorithms for DAWG, subsequence automata, and suffix arrays in this setting. Moreover, we succeeded to prove a tight upper-bound of the length of solutions of world equations containing one variable. Concerning with data compression, we showed a space, efficient algorithm which outputs a compact context-free grammar representing a given string, and proved its approximation ratio. We also showed a linear-time compression algorithm using longest first replacement heuristics. In order to find patterns from large database in reasonable time, we developed several algorithms for classes of generalized patterns. Especially, we proposed an efficient pattern discovery algorithm in which we allow small mismatches of the pattern with data, and verified that it is practical by a series of computational experiments.
|