2023 Fiscal Year Annual Research Report
多言語コーパス構築とドメイン適応による低資源機械翻訳
Project/Area Number |
22KJ1724
|
Allocation Type | Multi-year Fund |
Research Institution | National Institute of Information and Communications Technology |
Principal Investigator |
宋 海越 国立研究開発法人情報通信研究機構, ユニバーサルコミュニケーション研究所 先進的音声翻訳研究開発推進センター 先進的翻訳技術研究室, 研究技術員
|
Project Period (FY) |
2023-03-08 – 2024-03-31
|
Keywords | machine translation / low-resource languages / subword segmentation / subword encoding / decoding algorithm / corpora creation |
Outline of Annual Research Achievements |
Our research focused on enhancing machine translation for low-resource scenarios such as translation between Asian languages and English, and translation in specific domains such as the educational domain. To achieve this, we propose to 1) create bilingual corpora, mainly in the first year, for the low-resource domain and 2) optimize the subword segmentation information during the encoding phase in the second year and the decoding phase in the last year. As for the publications, during the last year, there were 3 first-authored journal papers and 1 conference paper published or submitted. Over the past three years, there have been 4 journal papers and 9 international conference papers, including co-authored papers. Additionally, one patent application is underway. This research has significantly improved the translation quality for low-resource scenarios. Through experiments, we found that the quality score measured by BLEU is improved by more than 3 points. The low-resource translation system is indispensable for cross-cultural communication in international events such as EXPO 2025. With our approach, we can make the translation system more practical for participants who speak low-resource languages.
|