研究実績の概要 |
In the previous fiscal year, we began conducting a systematic comparison of the corpus's accent annotations (two human + two machine for every sentence) and resolving any resulting disagreements through consultation with a phonetically-trained Tokyo Japanese native speaker. This undertaking finished during this fiscal year, and thus all 26,000+ entries in the database have now been hand-corrected.
While this undertaking was ongoing, extensive testing of Chamame's four types of accent rules (C, F, M, and P) proceeded in three stages. First, using the cleaned-up dataset of sentences from the dialogues in Genki, the accent rules were analyzed in terms of frequency, co-occurrence, and relationship to part-of-speech. Second, "P" rules (triggered upon prefixation) were examined in greater depth as a case study, using not only the data from the Genki corpus but also data from the appendix to Shinmeikai Japanese Accent Dictionary. Third, Unidic (the dictionary Chamame draws on for accent information) was analyzed in order to delineate the full range of possible accent rules and combinations thereof, with a carefully-selected illustrative example for each.
With the corpus finally completed and rule system thus clarified, the original plan was to run learning simulations and disseminate the results during the second half of this fiscal year. Unfortunately, due to a PI's change in affiliation to outside Japan, it was automatically required to terminate the grant. However, the research itself has continued, and with the corpus now complete, many exciting possibilities are on the horizon.