Multilingual corpus construction and domain adaptation for low-resource machine translation

Research Project

Project/Area Number	22KJ1724
Project/Area Number (Other)	21J23124 (2021-2022)
Research Category	Grant-in-Aid for JSPS Fellows
Allocation Type	Multi-year Fund (2023) Single-year Grants (2021-2022)
Section	国内
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	National Institute of Information and Communications Technology (2023) Kyoto University (2021-2022)
Principal Investigator	宋海越国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究技術員
Project Period (FY)	2023-03-08 – 2024-03-31
Project Status	Completed (Fiscal Year 2023)
Budget Amount *help	¥2,200,000 (Direct Cost: ¥2,200,000) Fiscal Year 2023: ¥700,000 (Direct Cost: ¥700,000) Fiscal Year 2022: ¥700,000 (Direct Cost: ¥700,000) Fiscal Year 2021: ¥800,000 (Direct Cost: ¥800,000)
Keywords	machine translation / low-resource languages / subword segmentation / subword encoding / decoding algorithm / corpora creation / ChatGPT / Machine translation / Parallel corpus creation / Pre-training / Data selection
Outline of Research at the Start	We focus on improving neural machine translation quality through leveraging large language models such as ChatGPT (current version is GPT-4). We will first test the ability and find the main problem of the current GPT-4 model on the translation task. We then focus on improving the GPT-4 based method through improving the prompts such as providing similar examples. We also have plan to fine-tune our own GPT model on the machine translation task based on open-sourced models such as LLaMA. Besides, we also continue utilizing better subword segmentation in the neural machine translation model.
Outline of Annual Research Achievements	Our research focused on enhancing machine translation for low-resource scenarios such as translation between Asian languages and English, and translation in specific domains such as the educational domain. To achieve this, we propose to 1) create bilingual corpora, mainly in the first year, for the low-resource domain and 2) optimize the subword segmentation information during the encoding phase in the second year and the decoding phase in the last year. As for the publications, during the last year, there were 3 first-authored journal papers and 1 conference paper published or submitted. Over the past three years, there have been 4 journal papers and 9 international conference papers, including co-authored papers. Additionally, one patent application is underway. This research has significantly improved the translation quality for low-resource scenarios. Through experiments, we found that the quality score measured by BLEU is improved by more than 3 points. The low-resource translation system is indispensable for cross-cultural communication in international events such as EXPO 2025. With our approach, we can make the translation system more practical for participants who speak low-resource languages.

Report

(3 results)

Research Products
(24 results)

All 2024 2023 2022 2021 Other

All Int'l Joint Research (1 results) Journal Article (3 results) (of which Peer Reviewed: 3 results, Open Access: 3 results) Presentation (16 results) (of which Int'l Joint Research: 12 results) Remarks (3 results) Patent(Industrial Property Rights) (1 results)

[Int'l Joint Research] University of Cape Town(南アフリカ)
- Related Report
  2023 Annual Research Report
[Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024
- Author(s)
  Song Haiyue、Mao Zhuoyuan、Dabre Raj、Chu Chenhui、Kurohashi Sadao
- Journal Title
  
  Journal of Natural Language Processing
  
  Volume: 31 Issue: 1 Pages: 155-188
- DOI
  10.5715/jnlp.31.155
- ISSN
  1340-7619, 2185-8314
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation2023
- Author(s)
  Song Haiyue、Dabre Raj、Chu Chenhui、Kurohashi Sadao、Sumita Eiichiro
- Journal Title
  
  ACM Transactions on Asian and Low-Resource Language Information Processing
  
  Volume: 22 Issue: 8 Pages: 1-24
- DOI
  10.1145/3610611
- Related Report
  2023 Annual Research Report
- Peer Reviewed / Open Access
[Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023
- Author(s)
  Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
- Journal Title
  
  Journal of Information Processing
  
  Volume: 31
- Related Report
  2022 Annual Research Report
- Peer Reviewed / Open Access
[Presentation] SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation.2024
- Author(s)
  Haiyue Song, Francois Meyer, Raj Dabre, Hideki Tanaka, Chenhui Chu, and Sadao Kurohashi.
- Organizer
  The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Linguistically Motivated Neural Machine Translation.2024
- Author(s)
  Haiyue Song, Hour Kaing, and Raj Dabre.
- Organizer
  The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages.2024
- Author(s)
  Francois Meyer, Haiyue Song, Abhisek Chakrabarty, Jan Buys, Raj Dabre and Hideki Tanaka.
- Organizer
  The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks.2024
- Author(s)
  Yahui Fu, Haiyue Song, Tianyu Zhao, Tatsuya Kawahara.
- Organizer
  The 14th International Workshop on Spoken Dialogue Systems Technology (IWSDS2024)
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Robust Neural Machine Translation for Abugidas by Glyph Perturbation2024
- Author(s)
  Hour Kaing, Chenchen Ding, Haiyue Song, Jiannan Mao, Hideki Tanaka, and Masao Utiyama.
- Organizer
  言語処理学会第30回年次大会
- Related Report
  2023 Annual Research Report
[Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models.2023
- Author(s)
  Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, Sadao Kurohashi.
- Organizer
  The 2023 Conference on Empirical Methods in Natural Language Processing
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation.2023
- Author(s)
  Zhuoyuan Mao, Raj Dabre, Qianying Liu, Haiyue Song, Chenhui Chu, and Sadao Kurohashi.
- Organizer
  The 61st Annual Meeting of the Association for Computational Linguistics
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023
- Author(s)
  Zhuoyuan Mao, Haiyue Song, Raj Dabre, Chenhui Chu, Sadao Kurohashi.
- Organizer
  Proceedings of the 1st International Workshop on Multilingual, Multimodal and Multitask Language Generation (Multi3Generation) held in conjection with EAMT2023.
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision.2023
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song and Sadao Kurohashi.
- Organizer
  The 17th Conference of the European Chapter of the Association for Computational Linguistics
- Related Report
  2023 Annual Research Report
- Int'l Joint Research
[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022
- Author(s)
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, and Sadao Kurohashi
- Organizer
  2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022
- Author(s)
  Haiyue Song, Raj Dabre, Chenhui Chu and Sadao Kurohashi
- Organizer
  言語処理学会第29回年次大会
- Related Report
  2022 Annual Research Report
[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022
- Author(s)
  Zhuoyuan Mao, Chenhui Chu, Raj Dabre, Haiyue Song, Zhen Wan, and Sadao Kurohashi
- Organizer
  Findings of the Association for Computational Linguistics: NAACL 2022
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
- Organizer
  17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023)
- Related Report
  2022 Annual Research Report
- Int'l Joint Research
[Presentation] Representative Data Selection for Sequence-to-Sequence Pre-training2022
- Author(s)
  Haiyue Song, Raj Dabre, Zhuoyuan Mao, Chenhui Chu, Sadao Kurohashi
- Organizer
  言語処理学会第28回年次大会
- Related Report
  2021 Annual Research Report
[Presentation] Improving Medical Relation Extraction with Distantly Supervised Pre-training2022
- Author(s)
  Zhen Wan, Fei Cheng, Qianying Liu, Zhuoyuan Mao, Haiyue Song, Sadao Kurohashi
- Organizer
  言語処理学会第28回年次大会
- Related Report
  2021 Annual Research Report
[Presentation] Video-guided Machine Translation with Spatial Hierarchical Attention Network2021
- Author(s)
  Weiqi Gu, Haiyue Song, Chenhui Chu, Sadao Kurohashi
- Organizer
  ACL-IJCNLP 2021 Student Research Workshop
- Related Report
  2021 Annual Research Report
- Int'l Joint Research
[Remarks] Haiyue Song's Homepage
- URL
  https://shyyhs.github.io/
- Related Report
  2023 Annual Research Report
[Remarks] 言語メディア研究室研究発表一覧
- URL
  https://nlp.ist.i.kyoto-u.ac.jp/?%E7%A0%94%E7%A9%B6%E7%99%BA%E8%A1%A8%E4%B8%80%E8%A6%A7
- Related Report
  2023 Annual Research Report
[Remarks] 先進的翻訳技術研究室論文
- URL
  https://att-astrec.nict.go.jp/result/
- Related Report
  2023 Annual Research Report
[Patent(Industrial Property Rights)] BERTSeg: BERT Based Subword Segmentation2022
- Inventor(s)
  ソウカイエツ
- Industrial Property Rights Holder
  国立研究開発法人情報通信研究機構
- Industrial Property Rights Type
  特許
- Filing Date
  2022
- Related Report
  2022 Annual Research Report

Multilingual corpus construction and domain adaptation for low-resource machine translation

Principal Investigator

宋 海越 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所 先進的音声翻訳研究開発推進センター 先進的翻訳技術研究室, 研究技術員

¥2,200,000 (Direct Cost: ¥2,200,000)

Report

Research Products

[Int'l Joint Research] University of Cape Town(南アフリカ)

Related Report

[Journal Article] DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation2024

Author(s)

Journal Title

DOI

ISSN

Related Report

[Journal Article] SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation2023

Author(s)

Journal Title

DOI

Related Report

[Journal Article] Spatial Hierarchical Attention Network Based Video-guided Machine Translation2023

Author(s)

Journal Title

Related Report

[Presentation] SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation.2024

Author(s)

Organizer

Related Report

[Presentation] Linguistically Motivated Neural Machine Translation.2024

Author(s)

Organizer

Related Report

[Presentation] NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages.2024

Author(s)

Organizer

Related Report

[Presentation] Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks.2024

Author(s)

Organizer

Related Report

[Presentation] Robust Neural Machine Translation for Abugidas by Glyph Perturbation2024

Author(s)

Organizer

Related Report

[Presentation] GPT-RE: In-context Learning for Relation Extraction using Large Language Models.2023

Author(s)

Organizer

Related Report

[Presentation] Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation.2023

Author(s)

Organizer

Related Report

[Presentation] Variable-length Neural Interlingua Representations for Zero-shot Neural Machine Translation.2023

Author(s)

Organizer

Related Report

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision.2023

Author(s)

Organizer

Related Report

[Presentation] BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation2022

Author(s)

Organizer

Related Report

[Presentation] Large Pre-trained Language Models with Multilingual Prompt for Japanese Natural Language Tasks2022

Author(s)

Organizer

Related Report

[Presentation] When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?2022

Author(s)

Organizer

Related Report

[Presentation] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision2022

Author(s)

Organizer

Related Report

[Presentation] Representative Data Selection for Sequence-to-Sequence Pre-training2022

Author(s)

Organizer

Related Report

[Presentation] Improving Medical Relation Extraction with Distantly Supervised Pre-training2022

宋海越国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所先進的音声翻訳研究開発推進センター先進的翻訳技術研究室, 研究技術員

[Remarks] 言語メディア研究室研究発表一覧

[Remarks] 先進的翻訳技術研究室論文