2016 Fiscal Year Annual Research Report

Statistical theory for string data analysis and its application to computational biochemistry

Research Project

Project/Area Number	26610037
Research Institution	Institute of Physical and Chemical Research
Principal Investigator	小谷野仁国立研究開発法人理化学研究所, 生命システム研究センター, 研究員 (10570989)
Co-Investigator(Kenkyū-buntansha)	林田守広京都大学, 化学研究所, 助教 (40402929)
Project Period (FY)	2014-04-01 – 2017-03-31
Keywords	文字列 / 確率論 / 統計学 / 機械学習 / 生物配列 / バイオインフォマティクス
Outline of Annual Research Achievements	本研究プロジェクトでは、まず、私達の以前の研究 Koyano and Kishino, Physical Review E, 2010 において、生物配列の統計的な解析方法の開発のために構築した、アルファベット A 上の文字列の全体が作る非可換位相半群 A* 上の確率論を拡張し、後で必要となる極限定理を揃えた。次に、これらの結果を応用して、A* においてマージン最大化原理の下で学習する機械学習の理論を構築した。ハードマージンとソフトマージンの場合の学習アルゴリズムを定式化し、それらの計算量を評価した上で、上述の極限定理を用いて、ある正則条件の下でその学習機械が漸近的に最適な仕方で文字列データを識別することを証明した。また、構築した学習機械を、塩基配列を用いた RNA の 2 次構造の予測問題とアミノ酸配列を用いたタンパク質間相互作用の予測問題に応用して、実際のデータ解析におけるその有用性を示した。次に、A* 上にパラメトリックな分布を導入し、その基本的な性質を調べることから始めて、その混合モデルに対する EM アルゴリズムの理論を構築し、上述の極限定理を応用することにより、その混合モデルに基づいて、正しい分類を行う事後確率が最大化されるという意味で漸近的に最適な、文字列データの教師なしクラスタリング方式を構成した。現在、この方法を用いて相同遺伝子の集団の gamma 多様性の解析を行っている。更に、A* 上に分布に対して中央文字列と中心文字列を導入し、A* が Levenshtein 距離によって距離空間をなしている場合にその探索問題を考察し、効率的にそれらを見つけるアルゴリズムを構成した。

Research Products
(5 results)

All 2017 2016

All Journal Article (3 results) (of which Peer Reviewed: 3 results, Acknowledgement Compliant: 1 results) Presentation (2 results) (of which Int'l Joint Research: 1 results)

[Journal Article] Finding median and center strings for a probability distribution on a set of strings under Levenshtein distance based on integer linear programming2017
- Author(s)
  Hayashida, M. and Koyano, H.
- Journal Title
  
  Communications in Computer and Information Science
  
  Volume: 690 Pages: 印刷中
- DOI
  10.1007/978-3-319-54717-6_7
- Peer Reviewed
[Journal Article] Maximum margin classifier working in a set of strings2016
- Author(s)
  Koyano, H., Hayashida, M., and Akutsu, T.
- Journal Title
  
  Proceedings of the Royal Society A
  
  Volume: 472 Pages: 印刷中
- DOI
  10.1098/rspa.2015.0551
- Peer Reviewed / Acknowledgement Compliant
[Journal Article] Integer linear programming approach to median and center strings for a probability distribution on a set of strings2016
- Author(s)
  Hayashida, M. and Koyano, H.
- Journal Title
  
  Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies
  
  Volume: 3 Pages: 35-41
- DOI
  10.5220/0005666400350041
- Peer Reviewed
[Presentation] Optimal string clustering based on a statistical theory on a topological monoid of strings2017
- Author(s)
  Koyano, H., Hayashida, M., and Akutsu, T.
- Organizer
  13th Workshop on Stochastic Models, Statistics and Their Applications
- Place of Presentation
  Berlin, Germany
- Year and Date
  2017-02-24 – 2017-02-24
- Int'l Joint Research
[Presentation] 文字列の集合上の確率分布における中央文字列および中心文字列に対する整数計画問題2016
- Author(s)
  林田守広, 小谷野仁
- Organizer
  日本情報処理学会「数理モデル化と問題解決研究会」, 「バイオ情報学研究会」及び日本電子情報通信学会「ニューロコンピューティング研究会」, 「情報論的学習理論と機械学習研究会」合同研究会
- Place of Presentation
  沖縄、日本
- Year and Date
  2016-07-04 – 2016-07-04

2016 Fiscal Year Annual Research Report

Statistical theory for string data analysis and its application to computational biochemistry

Principal Investigator

小谷野 仁 国立研究開発法人理化学研究所, 生命システム研究センター, 研究員 (10570989)

Research Products

[Journal Article] Finding median and center strings for a probability distribution on a set of strings under Levenshtein distance based on integer linear programming2017

Author(s)

Journal Title

DOI

[Journal Article] Maximum margin classifier working in a set of strings2016

Author(s)

Journal Title

DOI

[Journal Article] Integer linear programming approach to median and center strings for a probability distribution on a set of strings2016

Author(s)

Journal Title

DOI

[Presentation] Optimal string clustering based on a statistical theory on a topological monoid of strings2017

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] 文字列の集合上の確率分布における中央文字列および中心文字列に対する整数計画問題2016

Author(s)

Organizer

Place of Presentation

Year and Date

小谷野仁国立研究開発法人理化学研究所, 生命システム研究センター, 研究員 (10570989)