2017 年度実施状況報告書

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

研究課題

研究課題/領域番号	17K12739
研究機関	東京工業大学
研究代表者	ドローズドアレクサンドロ東京工業大学, 学術国際情報センター, 研究員 (90740126)
研究期間 (年度)	2017-04-01 – 2019-03-31
キーワード	benchmarking / evaluation / embeddings
研究実績の概要	During FY 2017 we have implemented a set of benchmarks which can be applicable to evaluation of text corpora quality. Our toolbox includes benchmarks for part of speech tagging, named entity recognition, chunking, sentiment polarity classification, word similarity judgment and analogical reasoning. All these benchmarks are integrated into a library called VSMlib and which is publicly available. Additionally we have done some work on interpretability of the results of these benchmarks, i.e. linking them to certain linguistic properties being captured by word embeddings. We have also made some preparatory engineering work for extracting raw texts from Common Crawl archives, performing deduplication etc.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 The project was being implement roughly as it was planned in the original proposal. The only impediment is that the amount of funding was decreased as compared to what was requested in the application, particulary
今後の研究の推進方策	In FY 2018 we will focus on using developed evaluation tools for producing actual corpora. We will keep improve quality of text filtering algorithms and evaluation benchmarks. Additionally we will work on performance and scalability aspects of our toolchain.
次年度使用額が生じた理由	We had to move 100000 yen from FY 2018 to FY 2017 budget as is was the minimal transferable amount, but we needed only half of this sum. Transferring remaining 48143 yen to FY 2018 budget would very helpful, as in FY 2018 purchase of additional equipment was planned.

研究成果
(5件)

すべて 2017 その他

すべて雑誌論文 (2件) (うち国際共著 2件、査読あり 2件、オープンアクセス 2件) 学会発表 (1件) 備考 (2件)

[雑誌論文] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017
- 著者名/発表者名
  Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, Xiaoyong Du
- 雑誌名
  
  Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
  
  巻: EMNLP 2017 ページ: 2421-2431
- DOI
  10.18653/v1/D17-1257
- 査読あり / オープンアクセス / 国際共著
[雑誌論文] The (too Many) Problems of Analogical Reasoning with Word Vectors2017
- 著者名/発表者名
  Anna Rogers, Aleksandr Drozd and Bofang Li
- 雑誌名
  
  Proceedings of the 6th Joint Conference on Lexical and Computational Semantics
  
  巻: *SEM 2017 ページ: 135-148
- DOI
  10.18653/v1/S17-1017
- 査読あり / オープンアクセス / 国際共著
[学会発表] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017
- 著者名/発表者名
  松岡聡, 遠藤敏夫, 額田彰, 三浦信一, 野村哲弘, 佐藤仁, 實本英之, Drozd Aleksand
- 学会等名
  情報処理学会研究報告 2017-HPC-160
[備考] source codes for VSMlib library
- URL
  https://github.com/undertherain/vsmlib
[備考] tutorials and datasets related to VSMlib
- URL
  http://vsm.blackbird.pw/

2017 年度 実施状況報告書

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

研究代表者

ドローズド アレクサンドロ 東京工業大学, 学術国際情報センター, 研究員 (90740126)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017

著者名/発表者名

雑誌名

DOI

[雑誌論文] The (too Many) Problems of Analogical Reasoning with Word Vectors2017

著者名/発表者名

雑誌名

DOI

[学会発表] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017

著者名/発表者名

学会等名

[備考] source codes for VSMlib library

URL

[備考] tutorials and datasets related to VSMlib

URL

2017 年度実施状況報告書

ドローズドアレクサンドロ東京工業大学, 学術国際情報センター, 研究員 (90740126)