文書画像と音声を統合的に理解可能なマルチモーダル言語生成モデルの開発

Research Project

Project/Area Number	24K20829
Research Category	Grant-in-Aid for Early-Career Scientists
Allocation Type	Multi-year Fund
Review Section	Basic Section 61030:Intelligent informatics-related
Research Institution	Tohoku University
Principal Investigator	高橋いつみ (斉藤いつみ) 東北大学, 情報科学研究科, 准教授 (90984287)
Project Period (FY)	2024-04-01 – 2027-03-31
Project Status	Granted (Fiscal Year 2024)
Budget Amount *help	¥4,550,000 (Direct Cost: ¥3,500,000、Indirect Cost: ¥1,050,000) Fiscal Year 2026: ¥1,170,000 (Direct Cost: ¥900,000、Indirect Cost: ¥270,000) Fiscal Year 2025: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000) Fiscal Year 2024: ¥1,690,000 (Direct Cost: ¥1,300,000、Indirect Cost: ¥390,000)
Keywords	マルチモーダル
Outline of Research at the Start	本研究計画では、画像・音声・言語を統合的に理解し信頼性の高いテキストを生成可能なマルチモーダル言語生成モデルの構築に取り組む。特に学術・ビジネスシーンの講演や会議で用いられる、スライド・PDF資料などの文書画像や図表画像と音声情報を高度に理解し、人間の知的活動をサポートする人工知能(AI)の実現を目指す。本研究では、大規模言語生成モデルと画像・音声モデルを組み合わせて、複合的なマルチモーダル情報を言語指示に従って高度に理解・生成するマルチモーダル言語生成モデルを構築する。また、生成されたテキストの信頼性評価と、評価結果を用いた出力の改善を行うことで出力の信頼性を向上させる方法を検証する。