2016 年度実施状況報告書

言語生産性：有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

研究課題

研究課題/領域番号	15K00317
研究機関	早稲田大学
研究代表者	LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
研究期間 (年度)	2015-04-01 – 2018-03-31
キーワード	自然言語処理 / 人工知能 / データ構造 / 形態で豊かな言語 / 中国語・日本語
研究実績の概要	Analogical clusters are a series of pairs of words or word sequences like: `invariable` : `lack of variability` :: `invisible` : `lack of visibility` :: ... :: `intractable` : `lack of tractability`. When given `incredible` : x, it is possible to deduce x automatically with high confidence: x = `lack of credibility`. If `inedible` : x is given, `lack of edibility`can be created but with less confidence, as it is rare in English. If `in double` : x is given, the production of the ridiculous word `doubility` should be barred. The goals of this research are: How to automatically and rapidly collect analogical clusters from data? How to estimate the reliability of new words or word sequences? Can such new word sequences be useful in machine translation? For Chinese--Japanese machine translation, redesigning our production pipeline and eliminating not promising short word sequences allowed to accelerate the production of new short sentences in Japanese or Chinese (50 times as fast). Results in Chinese-to-Japanese translation using quasi-parallel data produced by using analogical clusters have been published in a scientific journal. Analogical clusters have been produced on various languages using various corpora. A large amount of data has been released on our web site. A method to reduce numerous individual analogical equations to one massive analogical equation has been proposed and implemented. Results to compare morphological richness of languages or to predict new words have been published as workshop and conference papers.
現在までの達成度 (区分)	現在までの達成度 (区分) 2: おおむね順調に進展している理由 The project is now on schedule, even in advance for some part. New promising tracks which were not scheduled have also been explored. Make up for delay: Analogical clusters for all n-grams (from 1 to 6) in 11 languages of the Europarl corpus have been uploaded to our web server (http://lepage-lab.ips.waseda.ac.jp/ > Projects > Kakenhi 15K00317 > Experimental results). This was delayed in the first year. Task finished in advance: All the work concerning application of analogical clusters to improve statistical machine translation, scheduled for the third year, has been done during the second year. Very significant improvements in Chinese―Japanese machine translation by using quasi-parallel corpora produced using analogical clusters have been reported in a scientific journal. Continuous development and exploration of new tracks: The programs for the automatic extraction of clusters implemented during the first year have been improved during the second year. Reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. Automatic production of analogical grids is an additional technique which has been presented in two papers. It was not on our schedule, but it will be contribute in our work during the last year on estimating probabilities of new word sequences. Third year on schedule: The work on estimating the probability of new sequences of words using analogical clusters or grids is scheduled for the third year. It started during the second year.
今後の研究の推進方策	A way was to reduce analogical clusters to only few representative ratios. We use the notion of the median string of a set of strings for representativeness. This is not a true pre-compilation, but solving a small number of analogical equations has been made much faster. The reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. It uses a special compact representation has been implemented. These technique which partially relies on the use of maximal entropy and resolution of assignment problem will be reported in coming workshops or conferences. The third year will concentrate on the estimation of probabilities of new words inside analogical clusters or grids. Filling holes in an analogical grid will be safer than generating a word or a word sequences by solving one analogical equation with one analogical cluster, because more information from more clusters will contribute to the generation. This was not announced in the plan but it is a very promising help in the work on assigning more exact probability scores to newly generated sequences. We intend to participate in the SIGMORPHON campaign 2017 to test our use of analogical clusters and, if possible, resolution of massive analogical equations. This also was not on our schedule, but this offers a way to compare with our techniques with other international research centres.
次年度使用額が生じた理由	The amount to be used in the next fiscal year (56,810) comes from underuse of manpower (regular check of experiment machines, global server and network, see (2) in Usage plan).
次年度使用額の使用計画	The main usage will be in desktop computer (1), manpower (2) and travels (3). (1) A desktop computer for experiment will be bought. (2) Experiments will be run by 1 PhD student and master students; 1 technician will perform a regular inspection of experiment machines, global server and network. will be done. (3) One participation (1 person / 1 week) at an international conference is scheduled for results presentation.
備考	Analogical clusters in 11 languages for different sizes of N-grams for all N-grams in the same corresponding first thousand lines of Europarl corpus version 3.

研究成果
(8件)

すべて 2017 2016 その他

すべて雑誌論文 (2件) (うち査読あり 2件、オープンアクセス 2件、謝辞記載あり 2件) 学会発表 (5件) 備考 (1件)

[雑誌論文] Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese--Japanese machine translation.2017
- 著者名/発表者名
  W. Yang, H. Shen, and Y. Lepage
- 雑誌名
  
  Journal of Information Processing
  
  巻: 25 ページ: 88--99
- DOI
  10.2197/ipsjjip.25.88
- 査読あり / オープンアクセス / 謝辞記載あり
[雑誌論文] A method of generating translations of unseen n-grams by using proportional analogy2016
- 著者名/発表者名
  J. Luo and Y. Lepage
- 雑誌名
  
  IEEJ Transactions in Electronics, Information and Systems
  
  巻: 11(3) ページ: 325--330
- DOI
  10.1002/tee.22221
- 査読あり / オープンアクセス / 謝辞記載あり
[学会発表] Indonesian unseen words explained by form, morphology and distributional semantics at the same time.2017
- 著者名/発表者名
  R. Fam and Y. Lepage
- 学会等名
  言語処理学会第23回年次大会(NLP2017)論文集, pages 178--181.
- 発表場所
  筑波大学
- 年月日
  2017-03-14 – 2017-03-16
[学会発表] A study in explaining unseen words in Indonesian using analogical clusters2017
- 著者名/発表者名
  R. Fam and Y. Lepage
- 学会等名
  In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), pages 416--421.
- 発表場所
  Yangon, Myanmar
- 年月日
  2017-02-16 – 2017-02-17
[学会発表] Production of analogical clusters between marker-based chunks in Chinese and Japanese2016
- 著者名/発表者名
  W. Yang, M. Gao, and Y. Lepage
- 学会等名
  In Proceedings of the 10th International collaboration Symposium on Information, Production and Systems (ISIPS 2016), pages 238--241.
- 発表場所
  北九州
- 年月日
  2016-11-09 – 2016-11-11
[学会発表] Morphological predictability of unseen words using computational analogy2016
- 著者名/発表者名
  R. Fam and Y. Lepage
- 学会等名
  Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 51--60.
- 発表場所
  Atlanta, Georgia, USA.
- 年月日
  2016-10-31 – 2016-10-31
[学会発表] Solving analogical equations between strings of symbols using neural networks2016
- 著者名/発表者名
  V. Kaveeta and Y. Lepage
- 学会等名
  In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case- Based Reasoning (ICCBR-16), pages 67--76.
- 発表場所
  Atlanta, Georgia, USA.
- 年月日
  2016-10-31 – 2016-10-31
[備考] Projects / Kakenhi 15K00317 / Experimental results
- URL
  http://lepage-lab.ips.waseda.ac.jp/index.php/2016-08-01-06-37-56/kakenhi-2/kakenhi-2-experiment-result

2016 年度 実施状況報告書

言語生産性：有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

研究代表者

LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)

現在までの達成度 (区分)

理由

研究成果

[雑誌論文] Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese--Japanese machine translation.2017

著者名/発表者名

雑誌名

DOI

[雑誌論文] A method of generating translations of unseen n-grams by using proportional analogy2016

著者名/発表者名

雑誌名

DOI

[学会発表] Indonesian unseen words explained by form, morphology and distributional semantics at the same time.2017

著者名/発表者名

学会等名

発表場所

年月日

[学会発表] A study in explaining unseen words in Indonesian using analogical clusters2017

著者名/発表者名

学会等名

発表場所

年月日

[学会発表] Production of analogical clusters between marker-based chunks in Chinese and Japanese2016

著者名/発表者名

学会等名

発表場所

年月日

[学会発表] Morphological predictability of unseen words using computational analogy2016

著者名/発表者名

学会等名

発表場所

年月日

[学会発表] Solving analogical equations between strings of symbols using neural networks2016

著者名/発表者名

学会等名

発表場所

年月日

[備考] Projects / Kakenhi 15K00317 / Experimental results

URL

2016 年度実施状況報告書