研究課題/領域番号 |
21K12038
|
研究種目 |
基盤研究(C)
|
配分区分 | 基金 |
応募区分 | 一般 |
審査区分 |
小区分61030:知能情報学関連
|
研究機関 | 早稲田大学 |
研究代表者 |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
研究期間 (年度) |
2021-04-01 – 2024-03-31
|
研究課題ステータス |
交付 (2022年度)
|
配分額 *注記 |
4,030千円 (直接経費: 3,100千円、間接経費: 930千円)
2023年度: 1,040千円 (直接経費: 800千円、間接経費: 240千円)
2022年度: 1,820千円 (直接経費: 1,400千円、間接経費: 420千円)
2021年度: 1,170千円 (直接経費: 900千円、間接経費: 270千円)
|
キーワード | 自然言語処理 / 類推関係 / 埋め込み表現 / 類推関係データセット / アルゴリズム / 深層学習 |
研究開始時の研究の概要 |
The most important breakthrough in recent Natural Language Processing (NLP) is vector representations of words or parts of sentences. To assess the quality of vector representations of words, analogy test sets are used (France : Paris :: Japan : x => x = Tokyo). Up to now, the production of such data sets is not automatic. This research will study, explore and release theoretically well-founded methods to automatically extract analogy test sets not only between words but also between parts of sentences, and expectedly, for any language.
|
研究実績の概要 |
During the second year, work on the tasks (a) to (c) have been pursued in parallel. (a) Two approaches were tried to cast vector representations of strings: The first one directly used Parikh vector representations and the second one a one hidden-layer neural network. Recall and precision were measured on various data. (b) Several directions were explored. (b.1) A series of experiments to approximate real-valued vectors to integer-valued vectors was run. Several analogy test sets in several languages were used. The new version of the programs with acceleration, implemented during the first fiscal year, was used. No parallelogram representing analogies between vectors can be discovered in none of the settings. This result has been published in a an international conference. (b.2) Work on casting words from word analogy test sets into their definitions, i.e., sentences, was done. The definitions with the analogical structure induced by the word analogies were used to fine-tune a sentence embedding space with contrastive learning. Such fine-tuned spaces delivered better performance in semantic similarity tasks. (b.3) Programs have been written to automatically extract series of analogies from a subspace around a given word. Preliminary experiments were run on classical examples. The obtained analogies are almost always formal, although they originate from an embedding space built using the distributional hypothesis. (c) Parallelisation of programs is considered as finished in the first fiscal year.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
3: やや遅れている
理由
(a) The fact that no analogy can be found with existing tools has been further explored: several analogy test sets, several languages, and several word embedding spaces, have been used. A paper on these results has been published in an international conference. (b) Although various directions were explored, one with positive outcomes (fine-tuning of a sentence embedding space by relying on word analogies), the modification of existing C programs is needed. Premiminary work on the re-engineering of the C programs has started to identify the places were the use of real values entails modifications by contrast to the use of integers. (c) is considered finished as no new parallelisation could be introduced other than the ones made in the first fiscal year.
|
今後の研究の推進方策 |
In the third fiscal year, work on the tasks (a) and (b) will continue. New computing power in the form of a GPU machine has been acquired. The acquisition of a new GPU card for this new machine will be considered, if buget permits. Work on the problem of casting programs working on integer values to real values will continue. Work in the third fiscal year should profit from the work done during the second fiscal year with verious existing analogy test sets in various languages.
|