研究課題/領域番号 |
18K00656
|
研究機関 | 会津大学 |
研究代表者 |
Heo Younghyon 会津大学, コンピュータ理工学部, 上級准教授 (10631476)
|
研究分担者 |
Perkins Jeremy 会津大学, コンピュータ理工学部, 上級准教授 (30725635)
Paik Incheon 会津大学, コンピュータ理工学部, 教授 (70336478)
|
研究期間 (年度) |
2018-04-01 – 2021-03-31
|
キーワード | machine-translated text / keyword analysis |
研究実績の概要 |
We first compared the two text classification methods using different machine learning techniques (deep learning vs. supervised learning), and we found that supervised learning (SVM) with keyword analysis (TF-IDF) predicts machine-translated texts better with the accuracy rate above 80% across the board than Deep MPN (the accuracy rate ranges from 49.9% to 66.7%). This suggests that the analysis on the use of words (keyword or n-gram analysis) is the key aspect of the machine learning analysis of machine-translated texts. In the following research, we calculated the accuracy of document classification using two types of n-gram (unigram and bigram). The goal was to get best candidate model for the analysis. We calculated the accuracy of document classification and the similarity of feature vectors for each number of words using unigram and bigram. It was shown that for both types of n-gram, by setting the number of sentences as minimum 50 in a document and number of words as 1200, we could obtain high accuracy of classification (unigram: accuracy rate of 0.98 for 50/1200, bigram: accuracy rate of 0.979 for 50/1200). It was concluded that the best model for detection can be established with the condition of the document size being 50 line and word size being 1200.
|
現在までの達成度 (区分) |
現在までの達成度 (区分)
2: おおむね順調に進展している
理由
In this research, two linguists and one computer scientist are working together as a team. Regarding the research methodology and the result including technical terms and concepts in computer science, it is quite challenging for linguists to understand, and we have to rely a lot on the general interpretation of the result provided by the computer scientist. Our team members could manage to communicate with each other regarding the setting of the experiment and the result by putting much effort to communicate in general terms. We think finding out the linguistic implication from the highly technical experimental result will be much easier if we can communicate more smoothly by having a team member with the knowledge of both linguistics and computer science.
|
今後の研究の推進方策 |
Based on the results of our study in 2018 and 2019, we will develop class materials for teaching how to use Google Translate properly for academic English writing in the Thesis Writing and Presentation class for the 4th-year students at our university. Class materials will consist of four parts: 1) machine learning detection of Google translated documents, 2) linguistic features of machine-translated texts, 3) how to use Google Translate properly and 4) how to use Google Translate to learn about writing skills. In teaching the proper use of it, we can advise students how to use it based on our findings in 2018 and 2019. During the several sessions of teaching how to use Google Translate properly using the teaching materials provided, students will have a session of using the Google Translate with the sentences from their drafts. Learning about the features of Google translated sentences and good ways to produce natural sentences using Google Translate will help them write better English sentences in the thesis.
|
次年度使用額が生じた理由 |
We originally planned to participate in two conferences in 2019, but we ended up attending just one conference. We hope to spend more budget on creating good quality teaching materials and participate in more than 2 international/domestic conferences in 2020 for the overall report of the entire 3-year project.
|