2018 Fiscal Year Annual Research Report

機械学習向けハードウェアとの親和性が高い連立一次方程式の解法

Research Project

Project/Area Number	18H03248
Research Institution	Tokyo Institute of Technology
Principal Investigator	横田理央東京工業大学, 学術国際情報センター, 准教授 (20760573)
Co-Investigator(Kenkyū-buntansha)	伊田明弘東京大学, 情報基盤センター, 特任准教授 (80742121) 大島聡史九州大学, 情報基盤研究開発センター, 助教 (40570081)
Project Period (FY)	2018-04-01 – 2021-03-31
Keywords	H行列 / 階層的低ランク近似法 / テンソルコア / 機械学習向けハードウェア
Outline of Annual Research Achievements	本研究はこれから主流になるであろう機械学習向けのプロセッサに対して、計算科学アプリケーションの代表的なアルゴリズムである連立一次方程式の高速解法の最適化を行う。特に反復法による連立一次方程式の解法の前処理に注目し、機械学習向けプロセッサと相性の良い前処理法として階層的低ランク近似法を提案する。ただし、階層的低ランク近似法はそのままでは機械学習向けプロセッサに必要なテンソル積の演算を行うことはできない。平成30年度は、Volta GPUのテンソルコアを階層的低ランク近似法の中で用いるための最適なデータ構造を明らかにするために、H行列、H^2行列、HSS、HODLR、BLRの全ての階層的低ランク行列形式を自在に表現できるC++コードの開発を行なった。Dense, LowRank, Hierarhicalの3つのクラスのみを用いて、それぞれの間の演算子を定義することで非常にシンプルな形で全ての階層的低ランク近似法のアルゴリズムに自在に対応できるコードを開発することができた。これにより、平成30年度の目的であったテンソルコアを用いるのに最適なデータ構造の探索が容易になった。そこで次にbatched QR分解のテンソルコアを用いた実装を行なった。QR分解の内部カーネルには複数の行列積があるため、その部分をテンソルコアで実装した。精度の検証のために入出力もノルムの計算も単精度で行った場合、入出力は半精度でノルムの計算は単精度で行なった場合、両方を半精度で行なった場合と、それぞれの行列積をテンソルコアで行なった場合の計６通りの実験を行なった。その結果、単精度と変わらない精度でV100を１台用いて10TFLOPSの演算性能でQR分解を行うことができることが分かった。
Current Status of Research Progress	Current Status of Research Progress 1: Research has progressed more than it was originally planned. Reason 平成30年度の目標を達成しただけでなく、その過程でツールとして階層的低ランク近似法の全てのアルゴリズムに変幻自在に対応できるフレームワークを構築することができた。さらに、このコードのGPU化、MPI化、行列積だけでなくLU分解や　QR分解などへの適用もほぼできており、機能面では世界の他のどのコードよりも優れているものとなっている。平成31年度の目標であった低精度演算の有効活用においても既に顕著な成果が得られている。特に、単精度と変わらない精度でV100を１台用いて10TFLOPSの演算性能でQR分解を行うことができたのは世界でも他に例がなく、最終的に低ランク近似計算のホットスポットとなる部分でこのような性能のカーネルが早期に構築できたことは意義深い。
Strategy for Future Research Activity	機械学習向けプロセッサの開発競争は申請当初よりも激化しており、今後ますます多くの機械学習向けプロセッサが開発されると予想される。本研究がVolta GPUのテンソルコアを有効活用するために培った技術は、これらの新型プロセッサにも応用できる可能性があるため、できるだけ多くのプロセッサを用いた実験を行なって行きたい。その一例としてPEZY SC2を利用できるように契約を進めている。階層的低ランク近似法のコード開発に関しては、GPU化とMPI化、LU分解やQR分解への拡張を早期に完成させ、機能の拡張を行うフェーズから、性能のチューニングを行うフェーズに移行していく予定である。GPU化とMPI化に関しては、StarPUなどのランタイムを活用することで非同期的にタスクの依存関係やデータの依存関係を解析しながら処理していく方法を採用している。行列積に比べてLU分解やQR分解のGPU化やMPI化は難易度が高く、世界でも他にこの組み合わせを実現できている例はない。今年度中にこれらの実装が完成すれば、トップの論文誌や国際会議に掲載できる成果が数多く生み出せる可能性は高い。具体的には、H行列のLU分解の分散並列化に関するもの、H行列のQR分解の分散並列化に関するもの、H行列のLU分解のGPU化に関するもの、H行列のQR分解のGPU化に関するもの、などの組み合わせでいずれも新規性や有意性は十分にある。

Research Products
(12 results)

All 2019 2018 Other

All Int'l Joint Research (2 results) Journal Article (2 results) (of which Peer Reviewed: 2 results) Presentation (8 results) (of which Int'l Joint Research: 7 results, Invited: 2 results)

[Int'l Joint Research] University of Tennessee/Sandia National Laboratories(米国)
- Country Name
  U.S.A.
- Counterpart Institution
  University of Tennessee/Sandia National Laboratories
[Int'l Joint Research] KAUST(サウジアラビア)
- Country Name
  SAUDI ARABIA
- Counterpart Institution
  KAUST
[Journal Article] Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing2018
- Author(s)
  N. Maruyama, T. Aoki, K. Taura, R. Yokota, M. Wahib, M. Matsuda, K. Fukuda, T. Shimokawabe, N. Onodera, M. Muller, S. Iwasaki
- Journal Title
  
  Advanced Software Technologies for Post-Peta Scale Computing
  
  Volume: none Pages: 77--98
- DOI
  https://doi.org/10.1007/978-981-13-1924-2_5
- Peer Reviewed
[Journal Article] Application of hierarchical matrices to large-scale electromagnetic field analyses of coils wound with coated conductors2018
- Author(s)
  N. Tominaga, T. Mifune, A. Ida, Y. Sogabe, T. Iwashita, N. Amemiya
- Journal Title
  
  IEEE Transactions on Applied Superconductivity
  
  Volume: 28 Pages: 1--5
- DOI
  10.1109/TASC.2017.2780821
- Peer Reviewed
[Presentation] Tensorコアを用いたBatched QR分解2019
- Author(s)
  大友広幸, 横田理央
- Organizer
  第81回情報処理学会全国大会
[Presentation] Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU clusters2018
- Author(s)
  Ichitaro Yamazaki, Ahmad Abdelfattah, Akihiro Ida, Satoshi Ohshima, Stanimire Tomov, Rio Yokota, Jack Dongarra
- Organizer
  32nd IEEE International Parallel & Distributed Processing Symposium
- Int'l Joint Research
[Presentation] Optimization of Hierarchical Matrix Computation on GPU2018
- Author(s)
  Satoshi Ohshima, Ichitaro Yamazaki, Akihiro Ida, Rio Yokota
- Organizer
  SC Asia
- Int'l Joint Research
[Presentation] Accelerating Convolutional Neural Networks Using Low Precision Arithmetic2018
- Author(s)
  Hiroki Naganuma, Rio Yokota
- Organizer
  HPC Asia
- Int'l Joint Research
[Presentation] Energy Conserving Fast Multipole Methods for the Calculation of Long-range Interactions2018
- Author(s)
  Rio Yokota
- Organizer
  Mathematics in Action: Modeling and analysis in molecular biology and electrophysiology
- Int'l Joint Research / Invited
[Presentation] Can we use Hierarchical Low-Rank Approximation for Deep Learning?2018
- Author(s)
  Rio Yokota
- Organizer
  HPC Saudi
- Int'l Joint Research / Invited
[Presentation] Design of Parallel BEM Analyses Framework for SIMD Processors2018
- Author(s)
  Tetsuya Hoshino, Akihiro Ida, Toshihiro Hanawa, Kengo Nakajima
- Organizer
  The International Conference on Computational Science
- Int'l Joint Research
[Presentation] Lattice H-Matrices on Distributed-Memory Systems2018
- Author(s)
  Akihiro Ida
- Organizer
  32nd IEEE International Parallel & Distributed Processing Symposium
- Int'l Joint Research

2018 Fiscal Year Annual Research Report

機械学習向けハードウェアとの親和性が高い連立一次方程式の解法

Principal Investigator

横田 理央 東京工業大学, 学術国際情報センター, 准教授 (20760573)

Current Status of Research Progress

Reason

Research Products

[Int'l Joint Research] University of Tennessee/Sandia National Laboratories(米国)

Country Name

Counterpart Institution

[Int'l Joint Research] KAUST(サウジアラビア)

Country Name

Counterpart Institution

[Journal Article] Highly Productive, High-Performance Application Frameworks for Post-Petascale Computing2018

Author(s)

Journal Title

DOI

[Journal Article] Application of hierarchical matrices to large-scale electromagnetic field analyses of coils wound with coated conductors2018

Author(s)

Journal Title

DOI

[Presentation] Tensorコアを用いたBatched QR分解2019

Author(s)

Organizer

[Presentation] Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU clusters2018

Author(s)

Organizer

[Presentation] Optimization of Hierarchical Matrix Computation on GPU2018

Author(s)

Organizer

[Presentation] Accelerating Convolutional Neural Networks Using Low Precision Arithmetic2018

Author(s)

Organizer

[Presentation] Energy Conserving Fast Multipole Methods for the Calculation of Long-range Interactions2018

Author(s)

Organizer

[Presentation] Can we use Hierarchical Low-Rank Approximation for Deep Learning?2018

Author(s)

Organizer

[Presentation] Design of Parallel BEM Analyses Framework for SIMD Processors2018

Author(s)

Organizer

[Presentation] Lattice H-Matrices on Distributed-Memory Systems2018

Author(s)

Organizer

横田理央東京工業大学, 学術国際情報センター, 准教授 (20760573)