2007 Fiscal Year Final Research Report Summary
Development of efficient knowledge discovery systems for large semistructured data
Project/Area Number |
17200011
|
Research Category |
Grant-in-Aid for Scientific Research (A)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Intelligent informatics
|
Research Institution | FUJITSU LABORATORIES LTD. |
Principal Investigator |
OKAMOTO Seishi FUJITSU LABORATORIES LTD., Knowledge Research Center, FUJITSU LABORATORIES LTD., Senior Researcher (90399717)
|
Co-Investigator(Kenkyū-buntansha) |
TAKEDA Masayuki Kyushu University, Department of Informatics, Professor (50216909)
SHINOHARA Ayumi Tohoku University, Department of System Information Science, Professor (00226151)
KIDA Takuya Hokkaido University, Division of Computer Science, Associate Professor (70343316)
SAKAMOTO Hiroshi Kyushu Institute of Technology, Graduate School of Computer Science and Systems Engineering, Associate Professor (50315123)
HIRATA Kouichi Kyushu Institute of Technology, Graduate School of Computer Science and Systems Engineering, Associate Professor (20274558)
|
Project Period (FY) |
2005 – 2007
|
Keywords | Semistructured data / XML / Knoeledge discovery / Pattern discovery / Pattern Matching / Data compression |
Research Abstract |
By the rapid progress of Internet and Web service technologies, a new kind of massive data called semistructured data emerged, where a semistructured data is a collection of weakly structured electronic data such as Web pages and XML documents. In this research project, we studied efficient knowledge discovery systems for large semistructured data. First we studied theoretical foundations of learning and discovery for semistructured data. One of our main contributions is on kernels for trees. We introduced a new kernel function for labeled ordered trees and showed a hardness result in designing tree kernels for more general labeled trees(JSAI Best Paper Award in 2006). Another important one is on episode mining. We showed that an episode is parallel-free if and only if it is serially constructive. Next, we studied practical processing methods for semistructured data such as pattern matching, text compression, and index structures. Main contributions are as follows We devised efficient matching algorithms for path patterns based on the one-way sequential processing. These algorithms run 2 - 6 times faster and 6 times space-efficient in comparison with XMLTK. We also proposed an efficient index structure for the fast reachability test on directed graphs and implemented it(DEWS2007 BestPaper Award). Furthermore, we developed a new compressed pattern matching(CPM) algorithm that improves both the compression ratio and the search time ratio in comparison with a BPE type CPM algorithm. Finally, we applied the theoretical and practical results in this project to knowledge discovery systems. We demonstrated that these applications work effectively in various areas such as bioinformatics, pharmacy, music, traffic, and security.
|
Research Products
(141 results)