2018 Fiscal Year Research-status Report
Natural language processing for academic writing in English
Project/Area Number |
18K11446
|
Research Institution | The University of Kitakyushu |
Principal Investigator |
Goh ChooiLing 北九州市立大学, 国際環境工学部, 特任准教授 (90531616)
|
Co-Investigator(Kenkyū-buntansha) |
LEPAGE YVES 早稲田大学, 理工学術院(情報生産システム研究科・センター), 教授 (70573608)
|
Project Period (FY) |
2018-04-01 – 2021-03-31
|
Keywords | academic writing / plagiarism / word embeddings / text generation / summarization / lexical bundles |
Outline of Annual Research Achievements |
The first fiscal year has been dedicated to set up the infrastructure of the project and conduct preliminary experiments. An experiment server has been setup. For that, a DeepLearningBox has been acquired. A web site has been set up with a mock-up interface for the writing aid. The integration of the tools and pipelines mentioned below under this interface are in progress. As announced in the plan, the ACL Anthology Reference Corpus (ACL-ARC) has been downloaded. Plain texts have been extracted from the Omnipage XML files. Parts have been classified into Introduction, Body and Conclusion. A word embedding model has been built from the plain texts extracted from the ACL-ARC corpus. It has been compared with large pre-trained models for the proposal of alternate formulations or paraphrasing. An experiment involving human evaluation has been conducted. The results show that this model can be used as an alternative to pre-trained models. The results have been published in a paper at the annual conference of the Association for Natural Language Processing. Techniques for summarization have been tested on papers from the ACL-ARC data. Several pipelines have been built and compared. The final goal is to generate new sentences that can avoid plagiarism. A paper has been published at the annual conference of the Association for Natural Language Processing. A text search engine has been built in order to look for similar sentences based on a source sentence. The results are not yet satisfactory. Further experiments need to be carried out in the next fiscal year.
|
Current Status of Research Progress |
Current Status of Research Progress
3: Progress in research has been slightly delayed.
Reason
The slight delay is due to the change of affiliation of the primary investigator since the starting of research period in April 2018. The new job caused a decrease of time dedicated to research at the beginning. However, the foundation of the project has been built. The research environment has been setup, data has been collected and pre-processed, preliminary experiments have been carried out, which led to more concrete results in the following research. Besides, since the new job is being settled down, more time and energy can be dedicated on this project in the second fiscal year.
|
Strategy for Future Research Activity |
The basic infrastructure for the research project has been deployed. More experiments will be carried out in the next fiscal year. Firstly, survey on lexical bundles extracted from ACL-ARC will be carried out. The lexical bundles will be freely used for text generation. Secondly, search for similar texts in the text database has been carried out in the first fiscal year but the results are not satisfactory. Different methods will be surveyed so that more acceptable results can be obtained. Especially when the source sentence is long and complex, it is difficult to find similar texts in the database. Thirdly, metrics used for plagiarism will be surveyed and algorithms used for detecting plagiarism will be determined. A workstation is currently being setup as a web server, prepared for open access to the public. In this web server, we will provide a search engine looking for similar words using word embedding techniques, and a space for summarization based on Introduction and Conclusion. We will continue to develop a web application for academic writing aid that includes: a pane for text drafting, functionality to translate from native language to English, dictionary lookup from native language to English, lookup of synonyms, and search of similar sentences from ACL-ARC for writing references.
|
Causes of Carryover |
Fiscal year 2018 was busy with the setting of the infrastructure of the project. Preliminary experiments led to the publication of two papers in a local conference, but there was not enough material for the publication of an international paper. Consequently, overseas travel expenses were partly passed to the next fiscal year. Two papers are planned to submit to international conferences in the second fiscal year. The acquisition of a workstation initially planned for fiscal year 2019 was performed earlier. Thanks to this, the setting of a web server started earlier than expected. Expenses initially planned for the remuneration of students were passed over to the next fiscal year because not enough qualified manpower was available for the preparation of the data. However, more preparation of data will be continued in the next fiscal year. More remuneration will be used for developing the web application.
|
Research Products
(2 results)