Supercomputers and their I/O systems are built to host HPC applications. Typical HPC applications issue periodic bursts of output to the file system for intermediate results, checkpointing and visualization. For a typical HPC application, if the I/O system does not absorb its output fast enough, then memory to buffer the output is exhausted, forcing the computation to stall before it can output more data. Output stalls leave precious CPU resources underutilized, extending application runtime and compromising system throughput. One way to reduce output stalls is to add more memory and disk spindles. But these hardware resources are expensive, and supercomputers are designed with a careful balance of I/O and computational capabilities. In this talk, I will discuss output performance study on a production supercomputer Titan and its predecessor Jaguar, ranging from quantitative behavior analysis to predictive performance modeling. Specifically, I will talk about the challenges of benchmarking, profiling and modeling the output performance of supercomputer filesystems under production load, and discuss the techniques and methods I proposed to analyze the target machine according its design, deployment and configuration. Moreover, I will also show my works on workflow management on elastic virtual infrastructure, including the challenges, opportunities and my approach.
Biography:
Bing Xie is a Computer Science PhD candidate at Duke University. Bing's research develops performance analysis and prediction methods for understanding output behavior of supercomputer file systems. More broadly, her research interests span operating systems, distributed systems, file systems, machine learning, High Performance Computing and workflow. Her paper on petascale file system analysis was nominated for Best Paper and also for Best Student Paper for SC12.