Project/Area Number |
21300095
|
Research Category |
Grant-in-Aid for Scientific Research (B)
|
Allocation Type | Single-year Grants |
Section | 一般 |
Research Field |
Library and information science/Humanistic social informatics
|
Research Institution | Keio University |
Principal Investigator |
UEDA Shuichi 慶應義塾大学, 文学部, 教授 (50134218)
|
Co-Investigator(Kenkyū-buntansha) |
AGATA Teru 亜細亜大学, 国際関係学部, 准教授 (80306505)
EIKEUCHI Atsushi 筑波大学, 図書館情報メディア研究科, 准教授 (80338607)
|
Co-Investigator(Renkei-kenkyūsha) |
ISHIDA Emi 九州大学, 附属図書館, 准教授 (50364815)
NOZUE Michiko (財)鉄道総合技術研究所, その他部局等, 研究員 (40426044)
|
Project Period (FY) |
2009 – 2011
|
Project Status |
Completed (Fiscal Year 2011)
|
Budget Amount *help |
¥17,940,000 (Direct Cost: ¥13,800,000、Indirect Cost: ¥4,140,000)
Fiscal Year 2011: ¥5,070,000 (Direct Cost: ¥3,900,000、Indirect Cost: ¥1,170,000)
Fiscal Year 2010: ¥6,890,000 (Direct Cost: ¥5,300,000、Indirect Cost: ¥1,590,000)
Fiscal Year 2009: ¥5,980,000 (Direct Cost: ¥4,600,000、Indirect Cost: ¥1,380,000)
|
Keywords | 学術論文 / 検索エンジン / ウェブ構造 / 情報検索 / 自動分類 / 機械学習 / 学術情報 / サーチエンジン / ウェブ |
Research Abstract |
Open access scientific papers available on the Web could be searched through several search engines. For example, Google scholar has higher coverage of literature, although it does not necessarily guarantee free access to full text. We have developed and evaluated the "Aletheia" search engine for full text academic papers. The system obtains PDF files on a broad range of topics and automatically detects academic papers using classifiers based on text and structure features. We have built PDF database collection containing 3 million Japanese PDF files, five types of Weka classifiers(AdaBoost, Decision Tree(C4. 5), Naive Bayes, Random Forest, and Support Vector Machine) were separately trained for 20, 000 test collection using 10-fold cross-validation to automatically detect academic papers. The features were generated using hand-built rules and consisted by the three types of features : structure, URL, and content.
|