2018 Fiscal Year Annual Research Report
Corpora on Demand: Scalable Methods of Obtaining Linguistic Data
Project/Area Number |
17K12739
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
ドローズド アレクサンドロ 東京工業大学, 情報理工学院, 研究員 (90740126)
|
Project Period (FY) |
2017-04-01 – 2019-03-31
|
Keywords | corpora / embeddings / evaluation |
Outline of Annual Research Achievements |
We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents.
Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts.
All resources and source codes are available at the project website http://vecto.space
|