2018 年度実績報告書

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

研究課題

研究課題/領域番号	17K12739
研究機関	東京工業大学
研究代表者	ドローズドアレクサンドロ東京工業大学, 情報理工学院, 研究員 (90740126)
研究期間 (年度)	2017-04-01 – 2019-03-31
キーワード	corpora / embeddings / evaluation
研究実績の概要	We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents. Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts. All resources and source codes are available at the project website http://vecto.space