研究実績の概要 |
In this fiscal year we developed a method for auto-generating Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts the barrier required after advancing the solution every time step. We propose an execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this model, the time loop is moved inside persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching subset of the output in each time step in the unused registers and shared memory. PERKS can be generalized to any iterative solver: they largely independent of the solver’s implementation. We demonstrated the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.12x for 2D stencils and 1.24x for 3D stencils over state-of-art libraries), and a Krylov subspace conjugate gradient solver (geomean speedup of 4.86x in smaller SpMV datasets from SuiteSparse and 1.43x in larger SpMV datasets over a state-of-art library). All PERKS-based implementations available at: https://github.com/neozhang307/PERKS
We believe auto-generated PERKS kernels would be widely used in programming GPUs in the future.
|
今後の研究の推進方策 |
Our plan for the next fiscal year is to incorporate our PERKS GPU kernel execution method in a polyhedral compiler toolchain. In particular, to utilize a polyhedral model for auto-generating code: first, we analyze the algorithm to be implemented and express it as affine loop nests. Next, apply polyhedral transformations to these loop nests to optimize for various factors like parallelism, locality, and vectorization. Then, use the transformed loop nests to generate code targeting the desired architecture, leveraging tools like the Polyhedral Compilation Infrastructure (PCI). Finally, validate the generated code through testing and performance profiling, iterating as necessary to refine both the model and the generated code for optimal efficiency and correctness.
|