• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to project page

2017 Fiscal Year Research-status Report

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

Research Project

Project/Area Number 17K12739
Research InstitutionTokyo Institute of Technology

Principal Investigator

ドローズド アレクサンドロ  東京工業大学, 学術国際情報センター, 研究員 (90740126)

Project Period (FY) 2017-04-01 – 2019-03-31
Keywordsbenchmarking / evaluation / embeddings
Outline of Annual Research Achievements

During FY 2017 we have implemented a set of benchmarks which can be applicable to evaluation of text corpora quality. Our toolbox includes benchmarks for part of speech tagging, named entity recognition, chunking, sentiment polarity classification, word similarity judgment and analogical reasoning. All these benchmarks are integrated into a library called VSMlib and which is publicly available. Additionally we have done some work on interpretability of the results of these benchmarks, i.e. linking them to certain linguistic properties being captured by word embeddings.
We have also made some preparatory engineering work for extracting raw texts from Common Crawl archives, performing deduplication etc.

Current Status of Research Progress
Current Status of Research Progress

2: Research has progressed on the whole more than it was originally planned.

Reason

The project was being implement roughly as it was planned in the original proposal.
The only impediment is that the amount of funding was decreased as compared to what was requested in the application, particulary

Strategy for Future Research Activity

In FY 2018 we will focus on using developed evaluation tools for producing actual corpora. We will keep improve quality of text filtering algorithms and evaluation benchmarks. Additionally we will work on performance and scalability aspects of our toolchain.

Causes of Carryover

We had to move 100000 yen from FY 2018 to FY 2017 budget as is was the minimal transferable amount, but we needed only half of this sum. Transferring remaining 48143 yen to FY 2018 budget would very helpful, as in FY 2018 purchase of additional equipment was planned.

  • Research Products

    (5 results)

All 2017 Other

All Journal Article (2 results) (of which Int'l Joint Research: 2 results,  Peer Reviewed: 2 results,  Open Access: 2 results) Presentation (1 results) Remarks (2 results)

  • [Journal Article] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017

    • Author(s)
      Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, Xiaoyong Du
    • Journal Title

      Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

      Volume: EMNLP 2017 Pages: 2421-2431

    • DOI

      10.18653/v1/D17-1257

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] The (too Many) Problems of Analogical Reasoning with Word Vectors2017

    • Author(s)
      Anna Rogers, Aleksandr Drozd and Bofang Li
    • Journal Title

      Proceedings of the 6th Joint Conference on Lexical and Computational Semantics

      Volume: *SEM 2017 Pages: 135-148

    • DOI

      10.18653/v1/S17-1017

    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017

    • Author(s)
      松岡 聡, 遠藤 敏夫, 額田 彰, 三浦 信一, 野村 哲弘, 佐藤 仁, 實本 英之, Drozd Aleksand
    • Organizer
      情報処理学会 研究報告 2017-HPC-160
  • [Remarks] source codes for VSMlib library

    • URL

      https://github.com/undertherain/vsmlib

  • [Remarks] tutorials and datasets related to VSMlib

    • URL

      http://vsm.blackbird.pw/

URL: 

Published: 2018-12-17  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi