Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

Research Project

Project/Area Number	17K12739
Research Category	Grant-in-Aid for Young Scientists (B)
Allocation Type	Multi-year Fund
Research Field	Intelligent informatics
Research Institution	Tokyo Institute of Technology
Principal Investigator	Drozd Aleksandr 東京工業大学, 情報理工学院, 研究員 (90740126)
Project Period (FY)	2017-04-01 – 2019-03-31
Project Status	Completed (Fiscal Year 2018)
Budget Amount *help	¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000) Fiscal Year 2018: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000) Fiscal Year 2017: ¥780,000 (Direct Cost: ¥600,000、Indirect Cost: ¥180,000)
Keywords	corpora / evaluation / text representations / embeddings / benchmarking / 自然言語処理
Outline of Final Research Achievements	We have developed a set of evaluation benchmarks for text representations. These benchmarks can be used to estimate quality of text corpora, as well as methods for constructing representations themselves and corresponding hyperparameters. We have implemented heuristics for filtering good quality text fragments and instrumental scripts to help parsing different formats of archived internet documents. Since internet documents might contain copyrighted and private information, we could not publish the raw corpora. Instead we are publishing all our source codes to extract raw texts from different archive formats as well as models (such as word embeddings) that we have trained on large scale texts. All resources and source codes are available at the project website http://vecto.space
Academic Significance and Societal Importance of the Research Achievements	This research project contributed to the development of methods of filtering textual data and evaluating text representations.

Report

(3 results)

2018 Annual Research Report Final Research Report ( PDF )
2017 Research-status Report

Research Products
(11 results)

All 2018 2017 Other

All Journal Article (2 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 2 results, Open Access: 2 results) Presentation (3 results) (of which Int'l Joint Research: 2 results) Remarks (3 results) Funded Workshop (3 results)

[Journal Article] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017
- Author(s)
  Bofang Li, Tao Liu, Zhe Zhao, Buzhou Tang, Aleksandr Drozd, Anna Rogers, Xiaoyong Du
- Journal Title
  
  Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
  
  Volume: EMNLP 2017 Pages: 2421-2431
- DOI
  10.18653/v1/d17-1257
- Related Report
  2017 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] The (too Many) Problems of Analogical Reasoning with Word Vectors2017
- Author(s)
  Anna Rogers, Aleksandr Drozd and Bofang Li
- Journal Title
  
  Proceedings of the 6th Joint Conference on Lexical and Computational Semantics
  
  Volume: *SEM 2017 Pages: 135-148
- DOI
  10.18653/v1/s17-1017
- Related Report
  2017 Research-status Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Presentation] Subcharacter information in japanese embeddings: when is it worth it?2018
- Author(s)
  Marzena Karpinska, Bofang Li, Anna Rogers, and Aleksandr Drozd
- Organizer
  Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP (RELNLP)
- Related Report
  2018 Annual Research Report
- Int'l Joint Research
[Presentation] Subword-level composition functions for learning word embeddings2018
- Author(s)
  Bofang Li, Aleksandr Drozd, Tao Liu, and Xiaoyong Du
- Organizer
  2nd Workshop on Subword and Character level models in NLP (SCLeM)
- Related Report
  2018 Annual Research Report
- Int'l Joint Research
[Presentation] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017
- Author(s)
  松岡聡, 遠藤敏夫, 額田彰, 三浦信一, 野村哲弘, 佐藤仁, 實本英之, Drozd Aleksand
- Organizer
  情報処理学会研究報告 2017-HPC-160
- Related Report
  2017 Research-status Report
[Remarks] vecto library
- URL
  http://vecto.space
- Related Report
  2018 Annual Research Report
[Remarks] source codes for VSMlib library
- URL
  https://github.com/undertherain/vsmlib
- Related Report
  2017 Research-status Report
[Remarks] tutorials and datasets related to VSMlib
- URL
  http://vsm.blackbird.pw/
- Related Report
  2017 Research-status Report
[Funded Workshop] Deep Learning from HPC Perspectives: Opportunities and Challenges2018
- Related Report
  2018 Annual Research Report
[Funded Workshop] Distributional Compositional Semantics in the Age of Word Embeddings: Tasks, Resources and Methodology2018
- Related Report
  2018 Annual Research Report
[Funded Workshop] The Third Workshop on Evaluating Vector Space Representations for NLP2018
- Related Report
  2018 Annual Research Report

Corpora on Demand: Scalable Methods of Obtaining Linguistic Data

Principal Investigator

Drozd Aleksandr 東京工業大学, 情報理工学院, 研究員 (90740126)

¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)

Report

Research Products

[Journal Article] Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings2017

Author(s)

Journal Title

DOI

Related Report

[Journal Article] The (too Many) Problems of Analogical Reasoning with Word Vectors2017

Author(s)

Journal Title

DOI

Related Report

[Presentation] Subcharacter information in japanese embeddings: when is it worth it?2018

Author(s)

Organizer

Related Report

[Presentation] Subword-level composition functions for learning word embeddings2018

Author(s)

Organizer

Related Report

[Presentation] HPCとビッグデータ・AIを融合するグリーン・クラウドスパコンTSUBAME3.0の概要2017

Author(s)

Organizer

Related Report

[Remarks] vecto library

URL

Related Report

[Remarks] source codes for VSMlib library

URL

Related Report

[Remarks] tutorials and datasets related to VSMlib

URL

Related Report

[Funded Workshop] Deep Learning from HPC Perspectives: Opportunities and Challenges2018

Related Report

[Funded Workshop] Distributional Compositional Semantics in the Age of Word Embeddings: Tasks, Resources and Methodology2018

Related Report

[Funded Workshop] The Third Workshop on Evaluating Vector Space Representations for NLP2018

Related Report