2022 Fiscal Year Final Research Report

e-Phenotyping from clinical text for hereditary disorders and feasibility evaluation for clinical applications

Research Project

PDF

Project/Area Number	20H04279
Research Category	Grant-in-Aid for Scientific Research (B)
Allocation Type	Single-year Grants
Section	一般
Review Section	Basic Section 62010:Life, health and medical informatics-related
Research Institution	The University of Tokyo
Principal Investigator	Kawazoe Yoshimasa 東京大学, 医学部附属病院, 特任准教授 (10621477)
Co-Investigator(Kenkyū-buntansha)	関倫久東京大学, 医学部附属病院, 助教 (30528873) 篠原恵美子東京大学, 医学部附属病院, 特任助教 (40582755)
Project Period (FY)	2020-04-01 – 2023-03-31
Keywords	診療記録 / 遺伝性疾患 / 表現型 / 自然言語処理 / Phenotyping / Human Phenotype Ontology / Named Entity Recognition / Relation Extraction
Outline of Final Research Achievements	We collected case report texts for 362 cases of 151 designated intractable diseases and developed criteria for annotating phenotypes using 70 type of named entity tag and 35 type of relationship tags. We annotated 57,520 phenotypes and mapped these phenotypes to term codes in the disease name glossaries (UMLS, HPO, MEDIS standard disease name master). As a result, a corpus of 179 cases, for which permission for redistribution was obtained, was published on the researchers' website. A machine learning model was also developed to reproduce the annotations, and its accuracy was evaluated. Although the accuracy of unique expression extraction and relationship extraction was relatively high, the accuracy of mapping phenotype strings to HPO codes was insufficient and remains as future work.
Free Research Field	医療情報学
Academic Significance and Societal Importance of the Research Achievements	本研究は自然言語処理の基盤技術として、表現型（患者の状態）を抽出するための詳細なアノテーション基準を開発し、この基準でアノテートされた高品質なコーパスを構築・公開した。診療テキストを入力として、計算機がこのアノテーションを再現することで、患者の表現型（例えば、どの部位に症状が生じているのか、その症状は持続しているのか改善しているのかなど）を自動で抽出し集計できるようになる。機械学習による表現型の抽出は良好な性能を示したものの、抽出された表現型を医学用語集の用語に対応付けるエンティティリンキングの性能は十分ではないため、この性能を向上するための手法を開発することが今後の課題としてあげられた。