Project/Area Number |
22KJ1724
|
Project/Area Number (Other) |
21J23124 (2021-2022)
|
Research Category |
Grant-in-Aid for JSPS Fellows
|
Allocation Type | Multi-year Fund (2023) Single-year Grants (2021-2022) |
Section | 国内 |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | National Institute of Information and Communications Technology (2023) Kyoto University (2021-2022) |
Principal Investigator |
宋 海越 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所 先進的音声翻訳研究開発推進センター 先進的翻訳技術研究室, 研究技術員
|
Project Period (FY) |
2023-03-08 – 2024-03-31
|
Project Status |
Completed (Fiscal Year 2023)
|
Budget Amount *help |
¥2,200,000 (Direct Cost: ¥2,200,000)
Fiscal Year 2023: ¥700,000 (Direct Cost: ¥700,000)
Fiscal Year 2022: ¥700,000 (Direct Cost: ¥700,000)
Fiscal Year 2021: ¥800,000 (Direct Cost: ¥800,000)
|
Keywords | machine translation / low-resource languages / subword segmentation / subword encoding / decoding algorithm / corpora creation / ChatGPT / Machine translation / Parallel corpus creation / Pre-training / Data selection |
Outline of Research at the Start |
We focus on improving neural machine translation quality through leveraging large language models such as ChatGPT (current version is GPT-4). We will first test the ability and find the main problem of the current GPT-4 model on the translation task. We then focus on improving the GPT-4 based method through improving the prompts such as providing similar examples. We also have plan to fine-tune our own GPT model on the machine translation task based on open-sourced models such as LLaMA. Besides, we also continue utilizing better subword segmentation in the neural machine translation model.
|
Outline of Annual Research Achievements |
Our research focused on enhancing machine translation for low-resource scenarios such as translation between Asian languages and English, and translation in specific domains such as the educational domain. To achieve this, we propose to 1) create bilingual corpora, mainly in the first year, for the low-resource domain and 2) optimize the subword segmentation information during the encoding phase in the second year and the decoding phase in the last year. As for the publications, during the last year, there were 3 first-authored journal papers and 1 conference paper published or submitted. Over the past three years, there have been 4 journal papers and 9 international conference papers, including co-authored papers. Additionally, one patent application is underway. This research has significantly improved the translation quality for low-resource scenarios. Through experiments, we found that the quality score measured by BLEU is improved by more than 3 points. The low-resource translation system is indispensable for cross-cultural communication in international events such as EXPO 2025. With our approach, we can make the translation system more practical for participants who speak low-resource languages.
|