Project/Area Number |
17K12739
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Multi-year Fund |
Research Field |
Intelligent informatics
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
Drozd Aleksandr 東京工業大学, 情報理工学院, 研究員 (90740126)
|
Project Period (FY) |
2017-04-01 – 2019-03-31
|
Project Status |
Completed (Fiscal Year 2018)
|
Budget Amount *help |
¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Fiscal Year 2018: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Fiscal Year 2017: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
|
Keywords | corpora / evaluation / text representations / embeddings / benchmarking / 自然言語処理 |
Outline of Final Research Achievements |
We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents. Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts. All resources and source codes are available at the project website http://vecto.space
|
Academic Significance and Societal Importance of the Research Achievements |
This research project contributed to the development of methods of filtering textual data and evaluating text representations.
|