2010 Fiscal Year Final Research Report
Development of a classification system for data analysis methods based on natural language processing and collective intelligence
Project/Area Number |
21700315
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Single-year Grants |
Research Field |
Statistical science
|
Research Institution | National Institute of Genetics |
Principal Investigator |
OGASAWARA Osamu National Institute of Genetics, 生命情報・DDBJ研究センター, 助教 (00435512)
|
Project Period (FY) |
2009 – 2010
|
Keywords | データベース / 自然言語処理 / 統計処理システム |
Research Abstract |
As data measurement technology has advanced, increasing attention has been paid to data-intensive approaches, especially in the field of biology. In addition, as the performance of digital computers has increased, so has the sophistication of statistical analysis and other data analysis methods. The fusion of data measurement and the data analysis technologies is expected to have profound impacts on future biological research. However, from a practical standpoint, it is difficult for experimental scientists who are devoted to making the measurements that generate massive amounts of data but are not specialists in statistics to make full use of cutting-edge statistical analysis methods. To remedy the above-described problem, I have been publishing a database of statistical analysis procedures (the R Graphical Manual) since 2006. This database has the virtue that users can browse the functionality of procedures in the R statistical system by making use of all the provided images generated
… More
by invoking all the examples in the R statistical system, as well as enabling full text search of all documents in the R statistical system. This database has been highly acclaimed by users world-wide and the visit statistics for the database were 100,000 to 500,000 page views/month and 8,000 to 10,000 unique IPs/month in 2008. However, sufficient resources, both in terms of hardware and software, had not been allocated to the database, despite the high computational demand necessary for data preparation for this database. Thanks to the improved hardware and software environment of this project, the number of unique IPs per month has increased notably, to about 50,000 unique IPs/month (about 200,000 page views/month) in May 2011. Since the number of unique IPs/month of DDBJ (maintained by the National Institute of Genetics) is about 17,000 and that of KEGG (at Kyoto University) is about 200,000, the R Graphical Manual has grown in Japan into a database having comparable popularity to those famous databases. In this project, I developed a classification system of statistical procedures taken from statistical dictionaries, textbooks, and manuals that are contained in the R Graphical Manual. In order to map the functions in the R Graphical Manual to the categories of this classification system, I developed a novel algorithm to improve the performance of named entity recognition (NER). This algorithm is applied to all the individual manual entries contained within the R Graphical Manual to extract technical statistical terms and I made a mapping from each procedure entry to the classification categories. Less
|
Research Products
(4 results)