2016 Fiscal Year Research-status Report

言語生産性：有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

Research Project

Project/Area Number	15K00317
Research Institution	Waseda University
Principal Investigator	LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
Project Period (FY)	2015-04-01 – 2018-03-31
Keywords	自然言語処理 / 人工知能 / データ構造 / 形態で豊かな言語 / 中国語・日本語
Outline of Annual Research Achievements	Analogical clusters are a series of pairs of words or word sequences like: `invariable` : `lack of variability` :: `invisible` : `lack of visibility` :: ... :: `intractable` : `lack of tractability`. When given `incredible` : x, it is possible to deduce x automatically with high confidence: x = `lack of credibility`. If `inedible` : x is given, `lack of edibility`can be created but with less confidence, as it is rare in English. If `in double` : x is given, the production of the ridiculous word `doubility` should be barred. The goals of this research are: How to automatically and rapidly collect analogical clusters from data? How to estimate the reliability of new words or word sequences? Can such new word sequences be useful in machine translation? For Chinese--Japanese machine translation, redesigning our production pipeline and eliminating not promising short word sequences allowed to accelerate the production of new short sentences in Japanese or Chinese (50 times as fast). Results in Chinese-to-Japanese translation using quasi-parallel data produced by using analogical clusters have been published in a scientific journal. Analogical clusters have been produced on various languages using various corpora. A large amount of data has been released on our web site. A method to reduce numerous individual analogical equations to one massive analogical equation has been proposed and implemented. Results to compare morphological richness of languages or to predict new words have been published as workshop and conference papers.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason The project is now on schedule, even in advance for some part. New promising tracks which were not scheduled have also been explored. Make up for delay: Analogical clusters for all n-grams (from 1 to 6) in 11 languages of the Europarl corpus have been uploaded to our web server (http://lepage-lab.ips.waseda.ac.jp/ > Projects > Kakenhi 15K00317 > Experimental results). This was delayed in the first year. Task finished in advance: All the work concerning application of analogical clusters to improve statistical machine translation, scheduled for the third year, has been done during the second year. Very significant improvements in Chinese―Japanese machine translation by using quasi-parallel corpora produced using analogical clusters have been reported in a scientific journal. Continuous development and exploration of new tracks: The programs for the automatic extraction of clusters implemented during the first year have been improved during the second year. Reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. Automatic production of analogical grids is an additional technique which has been presented in two papers. It was not on our schedule, but it will be contribute in our work during the last year on estimating probabilities of new word sequences. Third year on schedule: The work on estimating the probability of new sequences of words using analogical clusters or grids is scheduled for the third year. It started during the second year.
Strategy for Future Research Activity	A way was to reduce analogical clusters to only few representative ratios. We use the notion of the median string of a set of strings for representativeness. This is not a true pre-compilation, but solving a small number of analogical equations has been made much faster. The reduction of numerous individual analogical equations to one massive analogical equation has been implemented on schedule but not yet extensively tested. It uses a special compact representation has been implemented. These technique which partially relies on the use of maximal entropy and resolution of assignment problem will be reported in coming workshops or conferences. The third year will concentrate on the estimation of probabilities of new words inside analogical clusters or grids. Filling holes in an analogical grid will be safer than generating a word or a word sequences by solving one analogical equation with one analogical cluster, because more information from more clusters will contribute to the generation. This was not announced in the plan but it is a very promising help in the work on assigning more exact probability scores to newly generated sequences. We intend to participate in the SIGMORPHON campaign 2017 to test our use of analogical clusters and, if possible, resolution of massive analogical equations. This also was not on our schedule, but this offers a way to compare with our techniques with other international research centres.
Causes of Carryover	The amount to be used in the next fiscal year (56,810) comes from underuse of manpower (regular check of experiment machines, global server and network, see (2) in Usage plan).
Expenditure Plan for Carryover Budget	The main usage will be in desktop computer (1), manpower (2) and travels (3). (1) A desktop computer for experiment will be bought. (2) Experiments will be run by 1 PhD student and master students; 1 technician will perform a regular inspection of experiment machines, global server and network. will be done. (3) One participation (1 person / 1 week) at an international conference is scheduled for results presentation.
Remarks	Analogical clusters in 11 languages for different sizes of N-grams for all N-grams in the same corresponding first thousand lines of Europarl corpus version 3.

Research Products
(8 results)

All 2017 2016 Other

All Journal Article (2 results) (of which Peer Reviewed: 2 results, Open Access: 2 results, Acknowledgement Compliant: 2 results) Presentation (5 results) Remarks (1 results)

[Journal Article] Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese--Japanese machine translation.2017
- Author(s)
  W. Yang, H. Shen, and Y. Lepage
- Journal Title
  
  Journal of Information Processing
  
  Volume: 25 Pages: 88--99
- DOI
  10.2197/ipsjjip.25.88
- Peer Reviewed / Open Access / Acknowledgement Compliant
[Journal Article] A method of generating translations of unseen n-grams by using proportional analogy2016
- Author(s)
  J. Luo and Y. Lepage
- Journal Title
  
  IEEJ Transactions in Electronics, Information and Systems
  
  Volume: 11(3) Pages: 325--330
- DOI
  10.1002/tee.22221
- Peer Reviewed / Open Access / Acknowledgement Compliant
[Presentation] Indonesian unseen words explained by form, morphology and distributional semantics at the same time.2017
- Author(s)
  R. Fam and Y. Lepage
- Organizer
  言語処理学会第23回年次大会(NLP2017)論文集, pages 178--181.
- Place of Presentation
  筑波大学
- Year and Date
  2017-03-14 – 2017-03-16
[Presentation] A study in explaining unseen words in Indonesian using analogical clusters2017
- Author(s)
  R. Fam and Y. Lepage
- Organizer
  In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), pages 416--421.
- Place of Presentation
  Yangon, Myanmar
- Year and Date
  2017-02-16 – 2017-02-17
[Presentation] Production of analogical clusters between marker-based chunks in Chinese and Japanese2016
- Author(s)
  W. Yang, M. Gao, and Y. Lepage
- Organizer
  In Proceedings of the 10th International collaboration Symposium on Information, Production and Systems (ISIPS 2016), pages 238--241.
- Place of Presentation
  北九州
- Year and Date
  2016-11-09 – 2016-11-11
[Presentation] Morphological predictability of unseen words using computational analogy2016
- Author(s)
  R. Fam and Y. Lepage
- Organizer
  Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 51--60.
- Place of Presentation
  Atlanta, Georgia, USA.
- Year and Date
  2016-10-31 – 2016-10-31
[Presentation] Solving analogical equations between strings of symbols using neural networks2016
- Author(s)
  V. Kaveeta and Y. Lepage
- Organizer
  In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case- Based Reasoning (ICCBR-16), pages 67--76.
- Place of Presentation
  Atlanta, Georgia, USA.
- Year and Date
  2016-10-31 – 2016-10-31
[Remarks] Projects / Kakenhi 15K00317 / Experimental results
- URL
  http://lepage-lab.ips.waseda.ac.jp/index.php/2016-08-01-06-37-56/kakenhi-2/kakenhi-2-experiment-result

2016 Fiscal Year Research-status Report

言語生産性：有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

Principal Investigator

LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)

Current Status of Research Progress

Reason

Research Products

[Journal Article] Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese--Japanese machine translation.2017

Author(s)

Journal Title

DOI

[Journal Article] A method of generating translations of unseen n-grams by using proportional analogy2016

Author(s)

Journal Title

DOI

[Presentation] Indonesian unseen words explained by form, morphology and distributional semantics at the same time.2017

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] A study in explaining unseen words in Indonesian using analogical clusters2017

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Production of analogical clusters between marker-based chunks in Chinese and Japanese2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Morphological predictability of unseen words using computational analogy2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Solving analogical equations between strings of symbols using neural networks2016

Author(s)

Organizer

Place of Presentation

Year and Date

[Remarks] Projects / Kakenhi 15K00317 / Experimental results

URL