Grant-in-Aid for international Scientific Research
|Allocation Type||Single-year Grants|
|Research Institution||The University of TOKYO|
WARD Nigel The University of TOKYO Department of Mechano-Informatics Associate Professor, 大学院・工学系研究科, 助教授 (00242008)
WARDNIGEL.G(1995) 東京大学, 工学部, 助教授
ワード・ナイジェルG.(1994) 東京大学, 工学部, 講師
TAJCHMAN Gar 国際コンピュータサイエンス研究所, 音声課, 研究員
MORGAN Nelso 国際コンピュータサイエンス研究所, 音声課・カリフォーニア大学・工学部・研究員, 教授
JURAFSKY Dan 国際コンピュータサイエンス研究所, 音声課・カリフォーニア大学・工学部・研究員, 助教授
TERADA Minoru The University of TOKYO Department of Mechano-Informatics, 大学院・工学系研究科, 助教授 (80163921)
INOUE Hirochika The University of TOKYO Department of Mechano-Informatics, 大学院・工学系研究科, 教授 (50111464)
DAN Jurafsky International Computer Science Institute
NELSON Morgan International Computer Science Institute
GARY Tajchman International Computer Science Institute
|Project Period (FY)
1994 – 1995
Completed(Fiscal Year 1995)
|Budget Amount *help
¥4,200,000 (Direct Cost : ¥4,200,000)
Fiscal Year 1995 : ¥1,400,000 (Direct Cost : ¥1,400,000)
Fiscal Year 1994 : ¥2,800,000 (Direct Cost : ¥2,800,000)
|Keywords||UserInterFace / Speech Understanding / Speech Input / Natural Language / Understanding / AIZUCHI / MultiModel / あいずち / 音声 / ユーザー・インタフェース / ノイズ / 英語 / 日本語 / 文法|
We are interested in the use of spoken language in human-computer interaction. The inspiration is the fact that, for human-human interaction, meaningful exchanges can take place even without accurate recognition of the words the other is saying --- this being possible due to shared knowledge and complementary communication channels, especially gesture and prosody. We want to exploit this fact for man-machine interfaces.
Therefore we are doing three things :
1. Using simple speech recognition to augment graphical user interfaces, well integrated with other input modalities : keyboard, mouse, and touch screen.
2. Building systems able to engage in simple conversations, using mostly prosodic clues. To sketch out our latest success :
We conjectured that it would be possible for Japanese to decide when to produce many back-channel utterances based on prosodic clues alone, without reference to meaning.
We found that
neither vowel lengthening, volume changes, nor energy level (to detect when the other finished speaking) were by themselves good predictors of when to produce an aizuchi. The best predictor was a low pitch level.
Specifically, upon detection of the end of a region of pitch less than.9 times the local median pitch and continuing for 150ms, coming after at least 600ms of speech, the system predicted an aizuchi 200ms to 300ms later, providing it had not done so within the preceding 1 second.
We also built a real-time system based on the above decision rule. A human stooge steered the conversation to a suitable topic and then switched on the system. After swich-on the stooge's utterances and the system's outputs, mixed together, produced one side of the conversation. We found that none of the 5 subjects had realized that his conversation partner had become partially automated.
3. Building tools and collecting data to help do 1 and 2.