Theoretically founded algorithms for the automatic production of analogy tests in NLP
Project/Area Number |
21K12038
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Review Section |
Basic Section 61030:Intelligent informatics-related
|
Research Institution | Waseda University |
Principal Investigator |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2021-04-01 – 2024-03-31
|
Project Status |
Granted (Fiscal Year 2022)
|
Budget Amount *help |
¥4,030,000 (Direct Cost: ¥3,100,000、Indirect Cost: ¥930,000)
Fiscal Year 2023: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Fiscal Year 2022: ¥1,820,000 (Direct Cost: ¥1,400,000、Indirect Cost: ¥420,000)
Fiscal Year 2021: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000)
|
Keywords | 自然言語処理 / 類推関係 / 埋め込み表現 / 類推関係データセット / アルゴリズム / 深層学習 |
Outline of Research at the Start |
The most important breakthrough in recent Natural Language Processing (NLP) is vector representations of words or parts of sentences. To assess the quality of vector representations of words, analogy test sets are used (France : Paris :: Japan : x => x = Tokyo). Up to now, the production of such data sets is not automatic. This research will study, explore and release theoretically well-founded methods to automatically extract analogy test sets not only between words but also between parts of sentences, and expectedly, for any language.
|
Outline of Annual Research Achievements |
During the second year, work on the tasks (a) to (c) have been pursued in parallel. (a) Two approaches were tried to cast vector representations of strings: The first one directly used Parikh vector representations and the second one a one hidden-layer neural network. Recall and precision were measured on various data. (b) Several directions were explored. (b.1) A series of experiments to approximate real-valued vectors to integer-valued vectors was run. Several analogy test sets in several languages were used. The new version of the programs with acceleration, implemented during the first fiscal year, was used. No parallelogram representing analogies between vectors can be discovered in none of the settings. This result has been published in a an international conference. (b.2) Work on casting words from word analogy test sets into their definitions, i.e., sentences, was done. The definitions with the analogical structure induced by the word analogies were used to fine-tune a sentence embedding space with contrastive learning. Such fine-tuned spaces delivered better performance in semantic similarity tasks. (b.3) Programs have been written to automatically extract series of analogies from a subspace around a given word. Preliminary experiments were run on classical examples. The obtained analogies are almost always formal, although they originate from an embedding space built using the distributional hypothesis. (c) Parallelisation of programs is considered as finished in the first fiscal year.
|
Current Status of Research Progress |
Current Status of Research Progress
3: Progress in research has been slightly delayed.
Reason
(a) The fact that no analogy can be found with existing tools has been further explored: several analogy test sets, several languages, and several word embedding spaces, have been used. A paper on these results has been published in an international conference. (b) Although various directions were explored, one with positive outcomes (fine-tuning of a sentence embedding space by relying on word analogies), the modification of existing C programs is needed. Premiminary work on the re-engineering of the C programs has started to identify the places were the use of real values entails modifications by contrast to the use of integers. (c) is considered finished as no new parallelisation could be introduced other than the ones made in the first fiscal year.
|
Strategy for Future Research Activity |
In the third fiscal year, work on the tasks (a) and (b) will continue. New computing power in the form of a GPU machine has been acquired. The acquisition of a new GPU card for this new machine will be considered, if buget permits. Work on the problem of casting programs working on integer values to real values will continue. Work in the third fiscal year should profit from the work done during the second fiscal year with verious existing analogy test sets in various languages.
|
Report
(2 results)
Research Products
(8 results)
-
-
-
-
-
-
-
[Presentation] Analogy on text data2022
Author(s)
Yves Lepage
Organizer
Invited talk at the workshop Interaction between Analogical Reasoning and Machine Learning (IARML 2022), 23rd of July 2022.
Related Report
Int'l Joint Research / Invited
-