• Search Research Projects
  • Search Researchers
  • How to Use
  1. Back to previous page

Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation

Research Project

Project/Area Number 22KJ1843
Project/Area Number (Other) 22J13719 (2022)
Research Category

Grant-in-Aid for JSPS Fellows

Allocation TypeMulti-year Fund (2023)
Single-year Grants (2022)
Section国内
Review Section Basic Section 61030:Intelligent informatics-related
Research InstitutionKyoto University

Principal Investigator

毛 卓遠  京都大学, 情報学研究科, 特別研究員(DC2)

Project Period (FY) 2023-03-08 – 2024-03-31
Project Status Completed (Fiscal Year 2023)
Budget Amount *help
¥1,700,000 (Direct Cost: ¥1,700,000)
Fiscal Year 2023: ¥800,000 (Direct Cost: ¥800,000)
Fiscal Year 2022: ¥900,000 (Direct Cost: ¥900,000)
Keywordslow-resource translation / sentence embedding / multilingual translation / multilingual embedding / model efficiency
Outline of Research at the Start

With globalization's progress, the demand for automatic multilingual language understanding and translation increases dramatically in many scenes.
We aim to tackle the technical barriers in low-resource machine translation (LMT) and design a robust multilingual translation system that supports a large number of the languages, including several low-resource languages.
(low-resource language: languages that we do not have sufficient data resources to conduct the translation model training)

Outline of Annual Research Achievements

In the last fiscal year, we developed a state-of-the-art lightweight sentence embedding model, LEALLA. With this pre-trained sentence-level semantic model, new parallel corpora could be constructed more efficiently using this pre-trained sentence embedding model. We also analyzed the Transformer model architecture for low-resource translation and published a paper to the top conference. Finally, we packed up all the work into a thesis.
In general, this research embarks on a comprehensive exploration of multilingual representation learning, especially for low-resource translation, addressing the three identified challenges within this domain:
(1) To address the high computational demand accompanying the expansion of multilingual model language coverage, we proposed an efficient and effective multilingual sentence embedding (MSE) model. We also introduced a new knowledge distillation method for training lightweight MSE.
(2) To tackle the challenge of data scarcity in low-resource languages, we proposed new pre-training objectives for low-resource NMT. Additionally, we introduced word-level contrastive learning for low-resource NMT utilizing statistical word alignments. We also introduced AlignInstruct to enhance translation accuracy in low-resource languages for large language models.
(3) To address the limitations in Transformer architecture for zero-shot NMT, we initially proposed a new Transformer architecture that constructs interlingual representations on top of the Transformer encoder. We also comprehensively examined the effects of layer normalization in zero-shot NMT.

Report

(2 results)
  • 2023 Annual Research Report
  • 2022 Annual Research Report
  • Research Products

    (19 results)

All 2024 2023 2022

All Journal Article (3 results) (of which Int'l Joint Research: 2 results,  Peer Reviewed: 3 results,  Open Access: 2 results) Presentation (13 results) (of which Int'l Joint Research: 12 results) Funded Workshop (3 results)

  • [Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024

    • Author(s)
      Song Haiyue、Mao Zhuoyuan、Dabre Raj、Chu Chenhui、Kurohashi Sadao
    • Journal Title

      Journal of Natural Language Processing

      Volume: 31 Issue: 1 Pages: 155-188

    • DOI

      10.5715/jnlp.31.155

    • ISSN
      1340-7619, 2185-8314
    • Related Report
      2023 Annual Research Report
    • Peer Reviewed / Open Access
  • [Journal Article] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022

    • Author(s)
      Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
    • Journal Title

      ACM Transactions on Asian and Low-Resource Language Information Processing

      Volume: Vol. 21, Issue 4, 68 Issue: 4 Pages: 1-29

    • DOI

      10.1145/3491065

    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Open Access / Int'l Joint Research
  • [Journal Article] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022

    • Author(s)
      Chenhui Chu, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi
    • Journal Title

      Language Resources and Evaluation

      Volume: Oct. 2022 Issue: 3 Pages: 1-15

    • DOI

      10.1007/s10579-022-09615-2

    • Related Report
      2022 Annual Research Report
    • Peer Reviewed / Int'l Joint Research
  • [Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models2023

    • Author(s)
      Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li and Sadao Kurohashi
    • Organizer
      Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation2023

    • Author(s)
      Zhuoyuan Mao, Raj Dabre, Qianying Liu, Haiyue Song, Chenhui Chu and Sadao Kurohashi
    • Organizer
      Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023

    • Author(s)
      Zhuoyuan Mao, Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
    • Organizer
      Workshop on Multilingual, Multimodal and Multitask Language Generation (Multi3Generation)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2023

    • Author(s)
      Zhuoyuan Mao and Tetsuji Nakagawa
    • Organizer
      Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2023

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
    • Organizer
      Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023): Findings Volume
    • Related Report
      2023 Annual Research Report
    • Int'l Joint Research
  • [Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

    • Author(s)
      Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan and Sadao Kurohashi
    • Organizer
      Findings of the Association for Computational Linguistics: NAACL 2022
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022

    • Author(s)
      Yibin Shen, Qianying Liu, Zhuoyuan Mao, Zhen Wan, Fei Cheng and Sadao Kurohashi
    • Organizer
      Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

    • Author(s)
      Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
    • Organizer
      Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022

    • Author(s)
      Yibin Shen, Qianying Liu, Zhuoyuan Mao, Fei Cheng and Sadao Kurohashi
    • Organizer
      Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022

    • Author(s)
      Zhen Wan, Qianying Liu, Zhuoyuan Mao, Fei Cheng, Sadao Kurohashi and Jiwei Li
    • Organizer
      Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022

    • Author(s)
      Zhuoyuan Mao and Tetsuji Nakagawa
    • Organizer
      Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

    • Author(s)
      Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
    • Organizer
      Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
    • Related Report
      2022 Annual Research Report
    • Int'l Joint Research
  • [Presentation] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022

    • Author(s)
      Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
    • Organizer
      言語処理学会 第29回年次大会
    • Related Report
      2022 Annual Research Report
  • [Funded Workshop] Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)2023

    • Related Report
      2023 Annual Research Report
  • [Funded Workshop] The 24th Annual Conference of The European Association for Machine Translation2023

    • Related Report
      2023 Annual Research Report
  • [Funded Workshop] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

    • Related Report
      2022 Annual Research Report

URL: 

Published: 2022-04-28   Modified: 2024-12-25  

Information User Guide FAQ News Terms of Use Attribution of KAKENHI

Powered by NII kakenhi