Visual speech recognition using ultrasound tongue and video lip/face images
Project/Area Number |
23520467
|
Research Category |
Grant-in-Aid for Scientific Research (C)
|
Allocation Type | Multi-year Fund |
Section | 一般 |
Research Field |
Linguistics
|
Research Institution | The University of Aizu |
Principal Investigator |
WILSON Ian 会津大学, コンピュータ理工学部, 教授 (50444930)
|
Project Period (FY) |
2011 – 2013
|
Project Status |
Completed (Fiscal Year 2013)
|
Budget Amount *help |
¥4,810,000 (Direct Cost: ¥3,700,000、Indirect Cost: ¥1,110,000)
Fiscal Year 2013: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2012: ¥1,300,000 (Direct Cost: ¥1,000,000、Indirect Cost: ¥300,000)
Fiscal Year 2011: ¥2,210,000 (Direct Cost: ¥1,700,000、Indirect Cost: ¥510,000)
|
Keywords | ultrasound / video / tongue / articulation / jaw / 超音波 / ビデオ / 舌 / 調音 / acoustics / optical flow / pixel-wise / computer lipreading / stress / pitch |
Research Abstract |
There are three main results of our research: (1) Related to video data collection of jaw movement, when measuring the amount of skin stretching over the mandible for the vowel in a CVC syllable, the onset consonant (but not the coda consonant) has a significant effect. (2) Related to ultrasound data collection of tongue position when speaking English, native (L1) speakers rest their tongue in a more efficient location (closer to the median position for English speech sounds) than Japanese (L2) speakers do. (3) Related to our focus on how best to construct and interpret a feature space we call MUTIS (midsagittal ultrasound tongue image space), results indicated that higher dimensions of MUTIS are most effective for identifying people, and that primarily the lower dimensions of VSS (vocal sound space) data are most effective for identifying phonemes. Trajectories within the VSS data indicate clear differences between L1 and L2 speakers, but not within the MUTIS data alone.
|
Report
(4 results)
Research Products
(22 results)