Software optimization by synergy of machine learning and high performance computing

Research Project

Project/Area Number	18F18786
Research Category	Grant-in-Aid for JSPS Fellows
Allocation Type	Single-year Grants
Section	外国
Research Field	High performance computing
Research Institution	The University of Tokyo
Principal Investigator	須田礼仁東京大学, 情報理工学(系)研究科, 教授 (40251392)
Co-Investigator(Kenkyū-buntansha)	VATAI EMIL 東京大学, 情報理工学(系)研究科, 外国人特別研究員
Project Period (FY)	2018-11-09 – 2021-03-31
Project Status	Declined (Fiscal Year 2020)
Budget Amount *help	¥1,500,000 (Direct Cost: ¥1,500,000) Fiscal Year 2020: ¥400,000 (Direct Cost: ¥400,000) Fiscal Year 2019: ¥700,000 (Direct Cost: ¥700,000) Fiscal Year 2018: ¥400,000 (Direct Cost: ¥400,000)
Keywords	通信削減アルゴリズム / matrix powers kernel / parallel computing / sparse matrix / spMV / iterative methods / 深層学習 / Dense layer / Pruning / データ圧縮
Outline of Annual Research Achievements	We implemented DMPK (Diamond Matrix Powers Kernel) with parallelization of MPI and optimized assignment of tasks to processors. We also analyzed the amount of communication and redundant computation when using different number of phases and compared them to PA1 and PA2, which are the known methods of matrix powers kernel. These results been presented at HPCAsia2020 in Fukuoka. Matrix Powers Kernel (MPK) algorithms calculate the vector Akx, obtained by multiplying an initial vector x with the k-th power of matrix A. Our algorithm, Diamond Matrix Powers Kernel (DMPK) generalizes the MPK algorithms PA1 and PA2 by Demmel et al. PA1 and PA2 can be used for general matrices. They improve performance by reducing the amount of communication, which is often the bottleneck, but they introduce redundant computations. In scientific computations with regular access patterns, diamond tiling algorithms achieve similar communication avoidance without introducing any redundant communication by introducing moving index domains. By combining these two approaches, DMPK, is applicable to general matrices and makes it possible to reduce the amount of redundant computation at the price of slightly higher amount of communication. This is done by translating the concept of moving index domains to general matrices: the algorithm is performed in “phases” and after each phase the graph (corresponding to the matrix) is repartitioned.
Current Status of Research Progress	Current Status of Research Progress 2: Research has progressed on the whole more than it was originally planned. Reason We are actively developing a new MPK algorithm which improves on our previous result by removing redundant computations altogether. The idea is to allow the transmission of partially computed vertices, hence the name Partial DMPK (PDMPK). The new algorithm requires more elaborate management of communication and computation patterns. It has difficulties finding new partitions for larger matrices. It also has a negative effect on the amount of communication and we are looking for ways to reduce it. Another requirement of PDMPK are a mirroring and a weight update algorithm. The mirroring algorithm transfers the vertices to their initial partitions which allows the PDMPK be useful in the long run. The weight update algorithm is needed by METIS to transform the vertex levels to edge weights. The current implementation supports easy swapping of these algorithms, but we have not yet found the optimal ones.
Strategy for Future Research Activity	We are looking for good mirroring and weight update algorithms which are able to handle all matrices. Also, the program is not yet optimized and we are looking for ways to implement optimizations so it can reach state of the art performance. The required optimizations include implementing communication hiding and multi-core processing next to the MPI communication. We already have ideas how to implement communication hiding using MPI asynchronous calls at each level. The multi-core optimization can follow existing methods, since the computation/execution part basically performs regular SpMV operations. For realistic comparison in performance, we plan to incorporate (P)DMPK into existing (optimized) libraries with linear solvers, such as PETSc. Another point of optimization is the preparation program, which needs to be run only once (per matrix), but it does a significant amount of computation and is currently only a single thread program.