Introducing On-demand in LCRC: Towards a Convergence of On-demand and Batch Resource Allocation

The LCRC Pilot Project aims to explore a confluence of on-demand availability and environment management on one side, and batch scheduling on the other. The project seeks to develop methods combining on-demand, currently requested by our APS users, and support for batch computing, currently the mode of resource management available in LCRC.

Our proposed architecture is to at any given time dynamically assign and rebalance nodes in the cluster between two pools: an on-demand pool and a batch (on-availability) pool and implement a mechanism that will dynamically move nodes from one pool to the other to maximize on-demand availability, resource utilization, and reduce wait time for batch jobs.

The talk will describe an evaluation of the Balancer architecture developed by the project based on using real APS and LCRC workload traces from the past two years. Results show that our system can maintain high utilization, reduce batch job slowdown by ~50% while still maintaining SLA for on-demand users.

Argonne Leadership Computing Facility

Introducing On-demand in LCRC: Towards a Convergence of On-demand and Batch Resource Allocation

08/30/2016, 7am CT