High-radix network topologies such as the Dragonfly are being widely adapted in modern high-performance computing systems. However, Dragonfly-based HPC systems are susceptible to performance variability due to the sharing of network resources, which can cause significant issues for the applications. Recent studies have identified that application interference is a dominant cause of performance variability. Unfortunately, application interference is hard to measure on real systems. To address this problem, we use a trace-based, event-driven simulation approach in which communication traces are collected from the target machine and are replayed in fine-grained simulations. We then discuss about our preliminary results to validate the simulation results against real hardware and explore the performance variability via simulation.
Short Bio:
Xin Wang is a Ph.D. student in the Computer Science Department at Illinois Institute of Technology (IIT), US. She received her M.S. degree in Computer Science at IIT. Her research area is High Performance Computing (HPC) with a focus on resource management and job scheduling. In particular, her current work is about alleviating performance variability on future HPC systems.