Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation

Research Project

Project/Area Number	22KJ1843
Project/Area Number (Other)	22J13719 (2022)
Research Category	Grant-in-Aid for JSPS Fellows
Allocation Type	Multi-year Fund (2023) Single-year Grants (2022)
Section	国内
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Kyoto University
Principal Investigator	毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)
Project Period (FY)	2023-03-08 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥1,700,000 (Direct Cost: ¥1,700,000) Fiscal Year 2023: ¥800,000 (Direct Cost: ¥800,000) Fiscal Year 2022: ¥900,000 (Direct Cost: ¥900,000)
Keywords	low-resource translation / sentence embedding / multilingual translation / multilingual embedding / model efficiency
Outline of Research at the Start	With globalization's progress, the demand for automatic multilingual language understanding and translation increases dramatically in many scenes. We aim to tackle the technical barriers in low-resource machine translation (LMT) and design a robust multilingual translation system that supports a large number of the languages, including several low-resource languages. (low-resource language: languages that we do not have sufficient data resources to conduct the translation model training)
Outline of Annual Research Achievements	In the last fiscal year, we developed a state-of-the-art lightweight sentence embedding model, LEALLA. With this pre-trained sentence-level semantic model, new parallel corpora could be constructed more efficiently using this pre-trained sentence embedding model. We also analyzed the Transformer model architecture for low-resource translation and published a paper to the top conference. Finally, we packed up all the work into a thesis. In general, this research embarks on a comprehensive exploration of multilingual representation learning, especially for low-resource translation, addressing the three identified challenges within this domain: (1) To address the high computational demand accompanying the expansion of multilingual model language coverage, we proposed an efficient and effective multilingual sentence embedding (MSE) model. We also introduced a new knowledge distillation method for training lightweight MSE. (2) To tackle the challenge of data scarcity in low-resource languages, we proposed new pre-training objectives for low-resource NMT. Additionally, we introduced word-level contrastive learning for low-resource NMT utilizing statistical word alignments. We also introduced AlignInstruct to enhance translation accuracy in low-resource languages for large language models. (3) To address the limitations in Transformer architecture for zero-shot NMT, we initially proposed a new Transformer architecture that constructs interlingual representations on top of the Transformer encoder. We also comprehensively examined the effects of layer normalization in zero-shot NMT.

Report

(2 results)

2023 Annual Research Report
2022 Annual Research Report

Research Products
(19 results)

All 2024 2023 2022

All Journal Article (3 results) (of which Int'l Joint Research: 2 results, Peer Reviewed: 3 results, Open Access: 2 results) Presentation (13 results) (of which Int'l Joint Research: 12 results) Funded Workshop (3 results)

[Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024
- Author(s)
  Song Haiyue、Mao Zhuoyuan、Dabre Raj、Chu Chenhui、Kurohashi Sadao
- Journal Title
  
  Journal of Natural Language Processing
  
  Volume: 31 Issue: 1 Pages: 155-188
- DOI
  10.5715/jnlp.31.155
- ISSN
  1340-7619, 2185-8314
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Journal Title
  
  ACM Transactions on Asian and Low-Resource Language Information Processing
  
  Volume: Vol. 21, Issue 4, 68 Issue: 4 Pages: 1-29
- DOI
  10.1145/3491065
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022
- Author(s)
  Chenhui Chu, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara and Sadao Kurohashi
- Journal Title
  
  Language Resources and Evaluation
  
  Volume: Oct. 2022 Issue: 3 Pages: 1-15
- DOI
  10.1007/s10579-022-09615-2
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Int'l Joint Research
[Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models2023
- Author(s)
  Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li and Sadao Kurohashi
- Organizer
  Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation2023
- Author(s)
  Zhuoyuan Mao, Raj Dabre, Qianying Liu, Haiyue Song, Chenhui Chu and Sadao Kurohashi
- Organizer
  Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023
- Author(s)
  Zhuoyuan Mao, Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
- Organizer
  Workshop on Multilingual, Multimodal and Multitask Language Generation (Multi3Generation)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2023
- Author(s)
  Zhuoyuan Mao and Tetsuji Nakagawa
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2023
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023): Findings Volume
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan and Sadao Kurohashi
- Organizer
  Findings of the Association for Computational Linguistics: NAACL 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022
- Author(s)
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Zhen Wan, Fei Cheng and Sadao Kurohashi
- Organizer
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022
- Author(s)
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Organizer
  Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022
- Author(s)
  Yibin Shen, Qianying Liu, Zhuoyuan Mao, Fei Cheng and Sadao Kurohashi
- Organizer
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022
- Author(s)
  Zhen Wan, Qianying Liu, Zhuoyuan Mao, Fei Cheng, Sadao Kurohashi and Jiwei Li
- Organizer
  Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022
- Author(s)
  Zhuoyuan Mao and Tetsuji Nakagawa
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi
- Organizer
  Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu and Sadao Kurohashi
- Organizer
  言語処理学会第29回年次大会
- Related Report
  2022 Annual Research Report
[Funded Workshop] Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)2023
- Related Report
  2023 Annual Research Report
[Funded Workshop] The 24th Annual Conference of The European Association for Machine Translation2023
- Related Report
  2023 Annual Research Report
[Funded Workshop] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022
- Related Report
  2022 Annual Research Report

Unifying Pre-training and Multilingual Semantic Representation Learning for Low-resource Neural Machine Translation

Principal Investigator

毛 卓遠 京都大学, 情報学研究科, 特別研究員(DC2)

¥1,700,000 (Direct Cost: ¥1,700,000)

Report

Research Products

[Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024

Author(s)

Journal Title

DOI

ISSN

Related Report

[Journal Article] Linguistically Driven Multi-Task Pre-Training for Low-Resource Neural Machine Translation2022

Author(s)

Journal Title

DOI

Related Report

[Journal Article] SCTB-V2: the 2nd Version of the Chinese Treebank in the Scientific Domain2022

Author(s)

Journal Title

DOI

Related Report

[Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models2023

Author(s)

Organizer

Related Report

[Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation2023

Author(s)

Organizer

Related Report

[Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023

Author(s)

Organizer

Related Report

[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2023

Author(s)

Organizer

Related Report

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2023

Author(s)

Organizer

Related Report

[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

Author(s)

Organizer

Related Report

[Presentation] Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems2022

Author(s)

Organizer

Related Report

[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

Author(s)

Organizer

Related Report

[Presentation] Textual Enhanced Contrastive Learning for Solving Math Word Problems2022

Author(s)

Organizer

Related Report

[Presentation] Rescue Implicit and Long-tail Cases: Nearest Neighbor Relation Extraction2022

Author(s)

Organizer

Related Report

[Presentation] LEALLA: Learning Lightweight Language-agnostic Sentence Embedding with Knowledge Distillation2022

Author(s)

Organizer

Related Report

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

Author(s)

Organizer

Related Report

[Presentation] Efficiently Learning Multilingual Sentence Representation for Cross-lingual Sentence Classification2022

Author(s)

Organizer

Related Report

[Funded Workshop] Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023)2023

Related Report

[Funded Workshop] The 24th Annual Conference of The European Association for Machine Translation2023

Related Report

[Funded Workshop] Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing2022

Related Report

毛卓遠京都大学, 情報学研究科, 特別研究員(DC2)