SwFormer: Enabling Faster Foundation Models on new Sunway Supercomputer via Holistic Kernel Tiling and Scheduling
Ruohan Wu¹, Xianyu Zhu¹, Junshi Chen¹², Hong An¹²
¹School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
²Laoshan Laboratory, Qingdao, China
Published in Journal of Computer Science and Technology (JCST), 2025
Keywords: Hardware/software interfaces, Computer Systems Organization, Processor Architectures, Programming Techniques, Software Engineering
Abstract
Deep learning's continuous evolution has driven the creation of increasingly large foundation models, such as GPT-3, which requires optimized performance on large-scale computing platforms. The new Sunway Supercomputer, equipped with numerous SW26010pro processors, supports AI workloads in both all-shared and single-CG modes. However, existing optimizations primarily target AI operators like Generalized Matrix Multiplication (GEMM) in the single-CG mode, leaving challenges in scaling performance across all 6 CGs in the all-shared mode. This paper introduces SwFormer, a framework designed to accelerate foundation models via intra-op tiling and inter-op scheduling. The intra-op tiling method breaks down operators into fine-grained tiled kernels and employs an offline profiling-based approach to determine the optimal tiling strategy. The inter-op scheduling method employs heuristic graph traversal algorithms to automatically reorder the computation of these tiled kernels, thereby maximizing hardware utilization. Compared with operator libraries for the all-shared mode such as SWDNNv2 and SWattention, SwFormer's intra-op tiling method accelerates end-to-end GPT-3 6.7B and 13B models training by up to 1.27x. Evaluated with GPT-style models, the inter-op scheduling approach further outperforms the intra-op tiling method by up to 1.32x.
Recommended citation: Ruohan Wu, Xianyu Zhu, Junshi Chen, Hong An "SwFormer: Enabling Faster Foundation Models on new Sunway Supercomputer via Holistic Kernel Tiling and Scheduling." Journal of Computer Science and Technology(JCST), 2025.
@article{wu2025swformer,
title={SwFormer: Enabling Faster Foundation Models on new Sunway Supercomputer via Holistic Kernel Tiling and Scheduling},
author={Wu, Ruohan and Zhu, Xianyu and Chen, Junshi and An, Hong},
journal={Journal of Computer Science and Technology},
year={2025},
publisher={Springer}
}
