• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

Research Project

Project/Area Number 17K12739
Research Category

Grant-in-Aid for Young Scientists (B)

Allocation TypeMulti-year Fund
Research Field Intelligent informatics
Research InstitutionTokyo Institute of Technology

Principal Investigator

Drozd Aleksandr  東京工業大学, 情報理工学院, 研究員 (90740126)

Project Period (FY) 2017-04-01 – 2019-03-31
Project Status Completed (Fiscal Year 2018)
Budget Amount *help
¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Fiscal Year 2018: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Fiscal Year 2017: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Keywordscorpora / evaluation / text representations / embeddings / benchmarking / 自然言語処理
Outline of Final Research Achievements

We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents. Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts. All resources and source codes are available at the project website http://vecto.space

Academic Significance and Societal Importance of the Research Achievements

This research project contributed to the development of methods of filtering textual data and evaluating text representations.

Report

(3 results)
  • 2018 Annual Research Report   Final Research Report ( PDF )
  • 2017 Research-status Report
  • Research Products

    (11 results)

All 2018 2017 Other

All Journal Article (2 results) (of which Int'l Joint Research: 2 results,  Peer Reviewed: 2 results,  Open Access: 2 results) Presentation (3 results) (of which Int'l Joint Research: 2 results) Remarks (3 results) Funded Workshop (3 results)

  • [Journal Article] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017

    • Author(s)
      Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, Xiaoyong Du
    • Journal Title

      Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

      Volume: EMNLP 2017 Pages: 2421-2431

    • DOI

      10.18653/v1/d17-1257

    • Related Report
      2017 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] The (too Many) Problems of Analogical Reasoning with Word Vectors2017

    • Author(s)
      Anna Rogers, Aleksandr Drozd and Bofang Li
    • Journal Title

      Proceedings of the 6th Joint Conference on Lexical and Computational Semantics

      Volume: *SEM 2017 Pages: 135-148

    • DOI

      10.18653/v1/s17-1017

    • Related Report
      2017 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] Subcharacter information in japanese embeddings: when is it worth it?2018

    • Author(s)
      Marzena Karpinska, Bofang Li, Anna Rogers, and Aleksandr Drozd
    • Organizer
      Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP (RELNLP)
    • Related Report
      2018 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Subword-level composition functions for learning word embeddings2018

    • Author(s)
      Bofang Li, Aleksandr Drozd, Tao Liu, and Xiaoyong Du
    • Organizer
      2nd Workshop on Subword and Character level models in NLP (SCLeM)
    • Related Report
      2018 Annual Research Report
    • Int'l Joint Research
  • [Presentation] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017

    • Author(s)
      松岡 聡, 遠藤 敏夫, 額田 彰, 三浦 信一, 野村 哲弘, 佐藤 仁, 實本 英之, Drozd Aleksand
    • Organizer
      情報処理学会 研究報告 2017-HPC-160
    • Related Report
      2017 Research-status Report
  • [Remarks] vecto library

    • URL

      http://vecto.space

    • Related Report
      2018 Annual Research Report
  • [Remarks] source codes for VSMlib library

    • URL

      https://github.com/undertherain/vsmlib

    • Related Report
      2017 Research-status Report
  • [Remarks] tutorials and datasets related to VSMlib

    • URL

      http://vsm.blackbird.pw/

    • Related Report
      2017 Research-status Report
  • [Funded Workshop] Deep Learning from HPC Perspectives: Opportunities and Challenges2018

    • Related Report
      2018 Annual Research Report
  • [Funded Workshop] Distributional Compositional Semantics in the Age of Word Embeddings: Tasks, Resources and Methodology2018

    • Related Report
      2018 Annual Research Report
  • [Funded Workshop] The Third Workshop on Evaluating Vector Space Representations for NLP2018

    • Related Report
      2018 Annual Research Report

URL: 

Published: 2017-04-28   Modified: 2020-03-30  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi