2018 Fiscal Year Final Research Report
Corpora on Demand: Scalable Methods of Obtaining Linguistic Data
Project/Area Number |
17K12739
|
Research Category |
Grant-in-Aid for Young Scientists (B)
|
Allocation Type | Multi-year Fund |
Research Field |
Intelligent informatics
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
Drozd Aleksandr 東京工業大学, 情報理工学院, 研究員 (90740126)
|
Project Period (FY) |
2017-04-01 – 2019-03-31
|
Keywords | corpora / evaluation / text representations |
Outline of Final Research Achievements |
We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents. Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts. All resources and source codes are available at the project website http://vecto.space
|
Free Research Field |
natural language processing
|
Academic Significance and Societal Importance of the Research Achievements |
This research project contributed to the development of methods of filtering textual data and evaluating text representations.
|