2017 Fiscal Year Research-status Report
Corpora on Demand: Scalable Methods of Obtaining Linguistic Data
Project/Area Number |
17K12739
|
Research Institution | Tokyo Institute of Technology |
Principal Investigator |
ドローズド アレクサンドロ 東京工業大学, 学術国際情報センター, 研究員 (90740126)
|
Project Period (FY) |
2017-04-01 – 2019-03-31
|
Keywords | benchmarking / evaluation / embeddings |
Outline of Annual Research Achievements |
During FY 2017 we have implemented a set of benchmarks which can be applicable to evaluation of text corpora quality. Our toolbox includes benchmarks for part of speech tagging, named entity recognition, chunking, sentiment polarity classification, word similarity judgment and analogical reasoning. All these benchmarks are integrated into a library called VSMlib and which is publicly available. Additionally we have done some work on interpretability of the results of these benchmarks, i.e. linking them to certain linguistic properties being captured by word embeddings. We have also made some preparatory engineering work for extracting raw texts from Common Crawl archives, performing deduplication etc.
|
Current Status of Research Progress |
Current Status of Research Progress
2: Research has progressed on the whole more than it was originally planned.
Reason
The project was being implement roughly as it was planned in the original proposal. The only impediment is that the amount of funding was decreased as compared to what was requested in the application, particulary
|
Strategy for Future Research Activity |
In FY 2018 we will focus on using developed evaluation tools for producing actual corpora. We will keep improve quality of text filtering algorithms and evaluation benchmarks. Additionally we will work on performance and scalability aspects of our toolchain.
|
Causes of Carryover |
We had to move 100000 yen from FY 2018 to FY 2017 budget as is was the minimal transferable amount, but we needed only half of this sum. Transferring remaining 48143 yen to FY 2018 budget would very helpful, as in FY 2018 purchase of additional equipment was planned.
|