• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Neural Machine Translation for User-Generated Content

Research Project

Project/Area Number 20K19879
Research Category

Grant-in-Aid for Early-Career Scientists

Allocation TypeMulti-year Fund
Review Section Basic Section 61030:Intelligent informatics-related
Research InstitutionNational Institute of Information and Communications Technology

Principal Investigator

MARIE BENJAMIN  国立研究開発法人情報通信研究機構, 先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究員 (30869433)

Project Period (FY) 2020-04-01 – 2022-03-31
Project Status Discontinued (Fiscal Year 2021)
Budget Amount *help
¥4,160,000 (Direct Cost: ¥3,200,000、Indirect Cost: ¥960,000)
Fiscal Year 2021: ¥520,000 (Direct Cost: ¥400,000、Indirect Cost: ¥120,000)
Fiscal Year 2020: ¥3,640,000 (Direct Cost: ¥2,800,000、Indirect Cost: ¥840,000)
Keywordsmachine translation / COVID-19 / user-generated content / language model / Asian languages / user-generated text / deep learning / unsupervised learning / social media
Outline of Research at the Start

Machine translation has achieved significant advances during the last decade thanks to deep learning technologies and the establishment of neural machine translation (NMT). However, noisy user-generated content (UGC), for instance from online social networks, can still cause disastrous mistranslations in most NMT systems. NMT for UGC is an under-studied challenging topic. This research will create new datasets of UGC for evaluating state-of-the-art NMT systems and will propose new methods to improve NMT for UGC through unsupervised machine translation and style-transfer technologies.

Outline of Annual Research Achievements

The main achievement for the second year of this research is a new method to extend monolingual data in a low-resource domain and style (e.g., tweets on the topic of COVID-19) to generate larger data for training NMT. For instance, given a a small set of Japanese tweets (e.g., 1000 tweets) about the COVID-19 crisis, that is too small to train NMT, this method artificially extends it to million tweets on the same topic and makes it useful to train better NMT to translate tweets. Using this artificial data to train NMT yields improvements of NMT that becomes better at translating texts even for domain and style for which very few data is available. Experiments have been successfully conducted in various domains and styles (medical, IT, news, tweets, online discussions), and languages (French, German, Japanese). This work has also been extended for "personalizing" NMT, i.e., adapt NMT so it translates texts written by a specific person while preserving the characteristics of writing of this person.

Report

(2 results)
  • 2021 Annual Research Report
  • 2020 Research-status Report
  • Research Products

    (6 results)

All 2021 2020

All Journal Article (2 results) (of which Int'l Joint Research: 2 results,  Peer Reviewed: 2 results,  Open Access: 2 results) Presentation (4 results) (of which Int'l Joint Research: 2 results)

  • [Journal Article] Extremely low-resource neural machine translation for Asian languages2020

    • Author(s)
      Rubino Raphael、Marie Benjamin、Dabre Raj、Fujita Atsushi、Utiyama Masao、Sumita Eiichiro
    • Journal Title

      Machine Translation

      Volume: 34 Issue: 4 Pages: 347-382

    • DOI

      10.1007/s10590-020-09258-6

    • Related Report
      2020 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] Synthesizing Parallel Data of User-Generated Texts with Zero-Shot Neural Machine Translation2020

    • Author(s)
      Marie Benjamin、Fujita Atsushi
    • Journal Title

      Transactions of the Association for Computational Linguistics

      Volume: 8 Pages: 710-725

    • DOI

      10.1162/tacl_a_00341

    • Related Report
      2020 Research-status Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Presentation] Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers2021

    • Author(s)
      Benjamin Marie, Atsushi Fujita, Raphael Rubino
    • Organizer
      第13回最先端NLP勉強会
    • Related Report
      2021 Annual Research Report
  • [Presentation] Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers2021

    • Author(s)
      Benjamin Marie, Atsushi Fujita, Raphael Rubino
    • Organizer
      The 11th International Joint Conference on Natural Language Processing
    • Related Report
      2021 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Altering Parallel Data into User-Generated Texts with Zero-Shot Neural Machine Translation2021

    • Author(s)
      Marie Benjamin、Fujita Atsushi
    • Organizer
      言語処理学会第27回年次大会(NLP2021)
    • Related Report
      2020 Research-status Report
  • [Presentation] Tagged Back-translation Revisited: Why Does It Really Work?2020

    • Author(s)
      Marie Benjamin、Rubino Raphael、Fujita Atsushi
    • Organizer
      The 58th Annual Meeting of the Association for Computational Linguistics
    • Related Report
      2020 Research-status Report
    • Int'l Joint Research

URL: 

Published: 2020-04-28   Modified: 2022-12-28  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi