Improving Measures of Lexical Diversity and Multi-word Expressions for Japanese EFL Learners

Research Project

Project/Area Number	23K00597
Research Category	Grant-in-Aid for Scientific Research (C)
Allocation Type	Multi-year Fund
Section	一般
Review Section	Basic Section 02080:English linguistics-related
Research Institution	Kyoto Sangyo University
Principal Investigator	ブルックスギャビン京都産業大学, 外国語学部, 講師 (10610818)
Co-Investigator(Kenkyū-buntansha)	Jordan Jennifer 関西学院大学, 総合政策学部, 専任講師 (00469264) Higginbotham George 叡啓大学, ソーシャルシステムデザイン学部, 准教授 (20885090) CLENTON JONATHAN 広島大学, 人間社会科学研究科(総), 准教授 (80762434)
Project Period (FY)	2023-04-01 – 2026-03-31
Project Status	Granted (Fiscal Year 2023)
Budget Amount *help	¥4,680,000 (Direct Cost: ¥3,600,000、Indirect Cost: ¥1,080,000) Fiscal Year 2025: ¥2,210,000 (Direct Cost: ¥1,700,000、Indirect Cost: ¥510,000) Fiscal Year 2024: ¥1,430,000 (Direct Cost: ¥1,100,000、Indirect Cost: ¥330,000) Fiscal Year 2023: ¥1,040,000 (Direct Cost: ¥800,000、Indirect Cost: ¥240,000)
Keywords	lexical diversity / multi-word expressions / NLP / learner corpora / Lexical Diversity / Multi-word Expressions / Corpus Linguistics
Outline of Research at the Start	This project attempts to address the gap that exists with the accuracy of part-of-speech taggers that have been trained and validated on L1 language corpora when they are used on learner corpora produced by L1 Japanese learners of English. In order to do this, we will create a tagged corpus that consists of written essays, transcribed discussions, and transcribed presentations from a cohort of L1 Japanese university students. This will be used to create a POS model that will then be tested on a similar corpus of texts. The newly created POS tagger will then be made available for public use.
Outline of Annual Research Achievements	Following the research plan, this year I began to collect and analyze the data necessary for this project. I set up the two corpora (one of L1 Japanese English Language Learners and one with a diverse collection of L1 language backgrounds). I also performed a preliminary analysis to see how well existing NLP and LLM libraries would deal with L2 learner data and presented about these findings. In doing so I identified some areas where existing packages struggle with L2 spoken and written texts. I presented about these findings at a number of conferences. This will allow me achieve the purpose of the research by helping to identify the shortcomings with existing packages so that I can begin to address these issues in the next stage of the project.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason As stated above, this semester I was able to meet most of the goals set out in the initial proposal. I was able to organize the two corpora into a format that will allow them to be analyzed effectively in the next stage of the project. There were two items that were slightly more difficult than expected. First of all, the transcription of spoken texts was slower and slightly more expensive than originally anticipated. This resulted in a smaller corpus than I was initially anticipating for the spoken texts. However, in the preliminary analysis, this did not seem to affect my ability to use the corpus to analyze the effectiveness of the tools. The other issue was with the POS tagger as I was not able to get the GUI from the previous application to work with the additional features. This meant things had to be manually coded, which made it difficult to find RAs who were able to do assist with this part of the project. This year, I hope to be able to rewrite the GUI so that it works with the additional features that are necessary for this project.
Strategy for Future Research Activity	This year I intend to finish updating the POS tagger and begin to validate it on four test corpora, two spoken and two written. If necessary, I will also continue to add to the spoken corpus and get more texts transcribed for the purpose of analysis. One possible method for doing this that I intend to investigate is the use of a revised version of Whisper that has been updated to improve its performance on L2 speaker texts in order to be able to increase the number of spoken texts in the corpus. While it will still be necessary to check and clean the resulting texts, having an RA do this will be faster and more economical than hiring a transcriber to complete the process. After this has been completed, my goal is to use the updated tagger to replicate three existing studies involving lexical diversity and the use of multi-word expressions. The first two of these will look at lexical diversity and the final one will examine MWE usage over time. My hope is to complete and present on these studies by the end of the year and have them submitted for publication. After I have tested the updated tagger, I will begin to examine how to best make it available to other researchers.