• 研究課題をさがす
  • 研究者をさがす
  • KAKENの使い方
  1. 課題ページに戻る

2017 年度 実施状況報告書

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

研究課題

研究課題/領域番号 17K12739
研究機関東京工業大学

研究代表者

ドローズド アレクサンドロ  東京工業大学, 学術国際情報センター, 研究員 (90740126)

研究期間 (年度) 2017-04-01 – 2019-03-31
キーワードbenchmarking / evaluation / embeddings
研究実績の概要

During FY 2017 we have implemented a set of benchmarks which can be applicable to evaluation of text corpora quality. Our toolbox includes benchmarks for part of speech tagging, named entity recognition, chunking, sentiment polarity classification, word similarity judgment and analogical reasoning. All these benchmarks are integrated into a library called VSMlib and which is publicly available. Additionally we have done some work on interpretability of the results of these benchmarks, i.e. linking them to certain linguistic properties being captured by word embeddings.
We have also made some preparatory engineering work for extracting raw texts from Common Crawl archives, performing deduplication etc.

現在までの達成度 (区分)
現在までの達成度 (区分)

2: おおむね順調に進展している

理由

The project was being implement roughly as it was planned in the original proposal.
The only impediment is that the amount of funding was decreased as compared to what was requested in the application, particulary

今後の研究の推進方策

In FY 2018 we will focus on using developed evaluation tools for producing actual corpora. We will keep improve quality of text filtering algorithms and evaluation benchmarks. Additionally we will work on performance and scalability aspects of our toolchain.

次年度使用額が生じた理由

We had to move 100000 yen from FY 2018 to FY 2017 budget as is was the minimal transferable amount, but we needed only half of this sum. Transferring remaining 48143 yen to FY 2018 budget would very helpful, as in FY 2018 purchase of additional equipment was planned.

  • 研究成果

    (5件)

すべて 2017 その他

すべて 雑誌論文 (2件) (うち国際共著 2件、 査読あり 2件、 オープンアクセス 2件) 学会発表 (1件) 備考 (2件)

  • [雑誌論文] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017

    • 著者名/発表者名
      Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, Xiaoyong Du
    • 雑誌名

      Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

      巻: EMNLP 2017 ページ: 2421-2431

    • DOI

      10.18653/v1/D17-1257

    • 査読あり / オープンアクセス / 国際共著
  • [雑誌論文] The (too Many) Problems of Analogical Reasoning with Word Vectors2017

    • 著者名/発表者名
      Anna Rogers, Aleksandr Drozd and Bofang Li
    • 雑誌名

      Proceedings of the 6th Joint Conference on Lexical and Computational Semantics

      巻: *SEM 2017 ページ: 135-148

    • DOI

      10.18653/v1/S17-1017

    • 査読あり / オープンアクセス / 国際共著
  • [学会発表] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017

    • 著者名/発表者名
      松岡 聡, 遠藤 敏夫, 額田 彰, 三浦 信一, 野村 哲弘, 佐藤 仁, 實本 英之, Drozd Aleksand
    • 学会等名
      情報処理学会 研究報告 2017-HPC-160
  • [備考] source codes for VSMlib library

    • URL

      https://github.com/undertherain/vsmlib

  • [備考] tutorials and datasets related to VSMlib

    • URL

      http://vsm.blackbird.pw/

URL: 

公開日: 2018-12-17  

サービス概要 検索マニュアル よくある質問 お知らせ 利用規程 科研費による研究の帰属

Powered by NII kakenhi