Research on Security of Characters

Research Project

Project/Area Number	20K21797
Research Category	Grant-in-Aid for Challenging Research (Exploratory)
Allocation Type	Multi-year Fund
Review Section	Medium-sized Section 60:Information science, computer engineering, and related fields
Research Institution	Waseda University
Principal Investigator	Tatsuya Mori 早稲田大学, 理工学術院, 教授 (60708551)
Project Period (FY)	2020-07-30 – 2023-03-31
Project Status	Completed (Fiscal Year 2022)
Budget Amount *help	¥6,500,000 (Direct Cost: ¥5,000,000、Indirect Cost: ¥1,500,000) Fiscal Year 2021: ¥2,730,000 (Direct Cost: ¥2,100,000、Indirect Cost: ¥630,000) Fiscal Year 2020: ¥3,770,000 (Direct Cost: ¥2,900,000、Indirect Cost: ¥870,000)
Keywords	ホモグリフ / セキュリティ / 機械学習 / 認知 / 自然言語処理 / 著作権保護 / 文字 / ユニコード / 文字符号 / 文字セキュリティ / ホモグラフ / 符号化文字集合
Outline of Research at the Start	漢字の「卜」とカタカナの「ト」のように、外見が類似しているが異なる符号化文字（ホモグリフ）がもたらすセキュリティ脅威の問題に取り組む。ある文字とホモグリフが取り違われることがあった場合、多くの人間が気が付かない一方、ソフトウェアとして実装された自然言語処理では、必ずその差異が反映される。すなわち、ある文書に対する人間の認知と機械処理の結果にギャップが生じるため、意図的に不正な処理を誘発するセキュリティ脅威が存在する。本研究では、 (1) 代表的な自然言語処理応用に対するホモグリフ攻撃の脅威分析 (2) 有効な対策手段の開発を目的とする。
Outline of Final Research Achievements	This research project focuses on "homoglyphs," pairs of characters such as the Latin "a" and the Cyrillic "а" that look similar but are assigned different code points. While many people overlook these homoglyphs, natural language processing software reflects their differences, creating unique security risks. The results of this study highlight the challenges of dealing with homoglyphs in machine translation systems and show that not only the neural network, but also the preprocessing of the text significantly affects the results. Furthermore, as an application of this research, we developed a method for copyright protection of text that displays human-readable text but processes data with different character codes in the browser, effectively masking the original content.
Academic Significance and Societal Importance of the Research Achievements	本研究はホモグリフに関連するセキュリティ課題を探求した。その応用範囲は広範であるため、波及的効果が見込める。また、文字はブラウザやアプリケーションなど様々な場面で扱われ、最近注目を集める大規模言語モデルでも重要な役割を果たす。本研究の成果は、文字を扱うアプリケーションのセキュリティリスクを低減し、より安全なデジタル環境を提供するために必要な新たな手段を示している。以上のことから、本研究はその学術的価値に加え、社会的意義も大いに有する。