2015 Fiscal Year Annual Research Report

ＳＭＡＤによるビッグデータ類似検索超高速化とその応用

Research Project

Project/Area Number	25280002
Research Institution	The University of Tokyo
Principal Investigator	渋谷哲朗東京大学, 医科学研究所, 准教授 (60396893)
Project Period (FY)	2013-04-01 – 2017-03-31
Keywords	アルゴリズム / データ検索 / ビッグデータ / タンパク質立体構造 / バイオインフォマティクス
Outline of Annual Research Achievements	本研究の目的は、多様化するビッグデータ時代の巨大データに対し、巨大データゆえに見えてくるデータの統計的挙動を活用し、超高速検索を実現する技術を開発、応用することである。これまで、巨大データからの知識発見、学習理論分野では、様々な複雑な統計モデルが活用されてきているが、検索アルゴリズムの高速化への活用はきわめて困難でほとんどなされてこなかった。本研究では、本研究代表者が開発し世界的にも注目されている最新アルゴリズム開発手法SMAD（Statistical Model-based Algorithm Design）を用い、複雑な統計モデルを活用して、タンパク質立体構造データベースなどの生物学データベースをはじめとした複雑な巨大データからなるデータベースにおける超高速検索の実現をめざすとともに、新たな応用開拓をめざす。本研究でめざすSMADにおいては、大規模データのモデル抽出は重要な課題である。この課題に対し、前年度にひきつづき深層学習によるタンパク質立体構造のモデル抽出に関して研究をさらに推し進め、新たなタンパク質立体構造予測モデルを構築するのに成功した。また、モデル化の別の手法として、ニューラルネットワークによるモデル化についても研究を進め、新たなニューラルネットワークモデルを構築するとともに、その妥当性を検討した。また、立体構造内の原子の順序を考慮せずにタンパク質立体構造の類似検索を行うことは、順序を考慮した場合とくらべて格段に計算が困難であることが知られているが、この問題に対しても、SMADの枠組みに基づいて新たなアルゴリズムを構築することにも成功した。
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason 本研究の目的は、多様化するビッグデータ時代の巨大データに対し、巨大データゆえに見えてくるデータの統計的挙動を活用し、超高速検索を実現する技術を開発・応用することである。現在までに、大規模データからのモデル抽出技術の実現、さらに検索を利用した高精度RNA機能解析の実現などの基盤技術など、基盤技術、応用研究の両面において、着実に成果を上げている。まず、巨大データゆえに見えてくるデータの統計的挙動を見出すための、大規模データからのモデル抽出方法について、いくつかの手法を適用し、成果を上げることに成功した。近年脚光を浴びている深層学習やニューラルネットワークに基づく新たなRNA構造予測、タンパク質立体構造予測などのアルゴリズムを実現し、これらによって、対象のモデルの理解をより深く行うことが可能となった。また、応用面に関しても、これまでのタンパク質立体構造検索をさらに高機能化するため、アミノ酸塩基順序制限のない高速タンパク質立体構造検索技術を新たに開発することにも成功した。さらに、次世代シークエンサーの出力リードがレファレンスゲノムのどこに対応するか、という実際の生物学応用に直結する課題に関しても、高精度なマッピングアルゴリズムの構築に成功した。さらに、ギャップドシード検索に基づく新たなタンパク質機能予測アルゴリズムを開発することにも成功している。
Strategy for Future Research Activity	今後は、これまでの研究をさらに推し進め、実際への応用をさらに広げていくことを行う。タンパク質の機能解析は通常その配列から行われることが多い。しかし、実際には、タンパク質機能はその立体構造によって規定されているといわれ、そのような立体構造から直接機能予測することで、立体構造が判明したタンパク質に関して新たな機能を知ることができる可能性があり、そのような立体構造からの機能予測に関しても高精度予測アルゴリズムが必要とされている。そこで、タンパク質立体構造検索技術の応用として、タンパク質立体構造の検索アルゴリズムを活用した新たなタンパク質立体構造からのタンパク質機能予測アルゴリズムを構築することをねらっていく。また、現在次世代シークエンサーの革新的な低コスト化により、膨大な個人ゲノムデータが蓄積されている。そのデータを、家族性疾病の病因解析や、遺伝子検査サービスへシームレスに活用していくために必要とされる技術として、ゲノムデータベース上での親類検索があげられる。現状では、マイクロサテライト等を用いた旧来の統計的技術しか知られておらず、次世代シークエンサー時代に特化した新たな技術が求められている。そこで、SMAD技術の新たな応用先として、このようなゲノムデータベース上の高速親類検索の実現も目標としてアルゴリズム開発を進めていく。また、前年度から引き続き、プライバシーを考慮した検索に関しても、SMADの応用可能性の検討をさらに進め、本プロジェクト終了後の新たな発展へとつなげていくことも狙っていく。
Causes of Carryover	謝金を出す予定であった学生が体調不良で研究が中断したため、謝金を出せなかったこと、ならびに、Windows 10の次期バージョンが出るとの情報があり、PCの購入を延期したことにより、予定よりも少ない使用額となった。
Expenditure Plan for Carryover Budget	当該学生の体調が復調したことから、研究活動に参加してもらい、謝金を予定通り出す予定である。また、PC購入に関しても最終成果をまとめるにあたって必要な台数のPCを、次期WindowsPCがリリースされるされないにかかわらず、揃える予定である。

Research Products
(9 results)

All 2015 Other

All Int'l Joint Research (1 results) Journal Article (4 results) (of which Int'l Joint Research: 1 results, Peer Reviewed: 4 results, Open Access: 1 results, Acknowledgement Compliant: 1 results) Presentation (4 results) (of which Int'l Joint Research: 2 results, Invited: 2 results)

[Int'l Joint Research] National University of Singapore(シンガポール)
- Country Name
  SINGAPORE
- Counterpart Institution
  National University of Singapore
[Journal Article] An O(m log m)-time algorithm for detecting superbubbles2015
- Author(s)
  Wing-Kin Sung, Kunihiko Sadakane, Tetsuo Shibuya, Abha Belorkar and Iana Pyrogova
- Journal Title
  
  IEEE/ACM Transactions on Computational Biology and Bioinformatcs
  
  Volume: 12 Pages: 770-777
- DOI
  10.1109/TCBB.2014.2385696
- Peer Reviewed / Open Access / Int'l Joint Research
[Journal Article] Malphite: A Convolutional Neural Network and Ensemble Learning Based Protein Secondary Structure Predictor2015
- Author(s)
  Yang Li, Tetsuo Shibuya
- Journal Title
  
  Proc. IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
  
  Volume: 1 Pages: 1260-1266
- DOI
  10.1109/BIBM.2015.7359861
- Peer Reviewed / Acknowledgement Compliant
[Journal Article] Efficient Approximate 3-Dimensional Point Set Matching Using Root-Mean-Square Deviation Score2015
- Author(s)
  Yoichi Sasaki, Tetsuo Shibuya, Kimihito Ito, and Hiroki Arimura
- Journal Title
  
  LNCS
  
  Volume: 9371 Pages: 191-203
- DOI
  10.1007/978-3-319-25087-8_18
- Peer Reviewed
[Journal Article] Locating Controlling Regions of Neural Networks Using Constrained Evolutionary Computation2015
- Author(s)
  Mohammad A. Eita, Tetsuo Shibuya, and Amin A. Shoukry
- Journal Title
  
  Proc. 2015 IEEE Congress on Evolutionary Computation (CEC2015)
  
  Volume: 1 Pages: 1581-1588
- Peer Reviewed
[Presentation] Malphite: A Convolutional Neural Network and Ensemble Learning Based Protein Secondary Structure Predictor2015
- Author(s)
  Yang Li, Tetsuo Shibuya
- Organizer
  情報処理学会　バイオ情報学研究会
- Place of Presentation
  京都大学iPS細胞研究所(京都府京都市)
- Year and Date
  2015-12-07
[Presentation] Algorithm Design Paradigm Shift Needed for Bio Big Data2015
- Author(s)
  Tetsuo Shibuya
- Organizer
  Genomic Medicine 2015
- Place of Presentation
  Ho Chi Minh, Vietnam
- Year and Date
  2015-07-21
- Int'l Joint Research / Invited
[Presentation] Efficient Approximate 3-Dimensional Point Set Matching and Its Application to Molecular Pattern Matching2015
- Author(s)
  Yoichi Sasaki, Tetsuo Shibuya, Kimito Ito, Hiroki Arimura
- Organizer
  情報処理学会　バイオ情報学研究会
- Place of Presentation
  沖縄科学技術大学院大学(沖縄県国頭郡)
- Year and Date
  2015-06-15
[Presentation] Algorithmic Challenges to Bio Big Data,2015
- Author(s)
  Tetsuo Shibuya
- Organizer
  The 11th International Workshop on Advanced Genomics
- Place of Presentation
  一橋講堂(東京都千代田区)
- Year and Date
  2015-05-22
- Int'l Joint Research / Invited

2015 Fiscal Year Annual Research Report

ＳＭＡＤによるビッグデータ類似検索超高速化とその応用

Principal Investigator

渋谷 哲朗 東京大学, 医科学研究所, 准教授 (60396893)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] National University of Singapore(シンガポール)

Country Name

Counterpart Institution

[Journal Article] An O(m log m)-time algorithm for detecting superbubbles2015

Author(s)

Journal Title

DOI

[Journal Article] Malphite: A Convolutional Neural Network and Ensemble Learning Based Protein Secondary Structure Predictor2015

Author(s)

Journal Title

DOI

[Journal Article] Efficient Approximate 3-Dimensional Point Set Matching Using Root-Mean-Square Deviation Score2015

Author(s)

Journal Title

DOI

[Journal Article] Locating Controlling Regions of Neural Networks Using Constrained Evolutionary Computation2015

Author(s)

Journal Title

[Presentation] Malphite: A Convolutional Neural Network and Ensemble Learning Based Protein Secondary Structure Predictor2015

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Algorithm Design Paradigm Shift Needed for Bio Big Data2015

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Efficient Approximate 3-Dimensional Point Set Matching and Its Application to Molecular Pattern Matching2015

Author(s)

Organizer

Place of Presentation

Year and Date

[Presentation] Algorithmic Challenges to Bio Big Data,2015

Author(s)

Organizer

Place of Presentation

Year and Date

渋谷哲朗東京大学, 医科学研究所, 准教授 (60396893)