swPredictor: A Data-Driven Performance Model for Distributed Data Parallelism Training on Large-Scale HPC Clusters

Xianyu Zhu¹, Ruohan Wu¹, Junshi Chen¹², Hong An¹²

¹School of Computer Science and Technology, University of Science and Technology of China, Hefei, China
²Laoshan Laboratory, Qingdao, China

Published in Performance Evaluation: An International Journal (PEVA), 2025


Keywords: swPredictor, FI-Net, FNO-Inception, Performance Modeling, HPC, Distributed Data Parallelism


Abstract

Given the complexity of heterogeneous architectures and multi-node collaboration, large-scale HPC (High-Performance Computing) clusters pose challenges in resource utilization and performance optimization during distributed data parallelism (DDP) training. Performance modeling aims to identify application bottlenecks and guide algorithm design, but existing performance models rarely consider the impact of system architecture on communication performance or provide a systematic analysis of distributed training. To address these issues, this paper proposes swPredictor, a data-driven performance model devised for accurately predicting the performance of DDP training. First, an original performance dataset is developed based on various communication patterns at runtime to avoid systematic errors. Subsequently, a novel multi-branch module FNO-Inception is proposed, combining FNO (Fourier Neural Operator) layer with Inception structure to simultaneously utilize various frequency features. Finally, by introducing the FNO-Inception module, a novel regression model FI-Net is constructed to fit complex nonlinear relationships. The experimental results demonstrate that FI-Net can accurately predict the performance of DDP training on the Sunway OceanLight supercomputer with an overall MAPE of 0.93%, which outperforms the other baseline models.


Recommended citation: Xianyu Zhu, Ruohan Wu, Junshi Chen, Hong An "swPredictor: A Data-Driven Performance Model for Distributed Data Parallelism Training on Large-Scale HPC Clusters." Performance Evaluation: An International Journal (PEVA), 2025.

BibTeX
@article{zhu2025swPredictor,
  title={swPredictor: A Data-Driven Performance Model for Distributed Data Parallelism Training on Large-Scale HPC Clusters},
  author={Zhu, Xianyu and Wu, Ruohan and Chen, Junshi and An, Hong},
  journal={Performance Evaluation: An International Journal},
  year={2025},
  publisher={Elsevier}
}