Sep 06, 2024; Colloquium
Echtzeit-AGDistributed Gang Scheduling in Userland
In traditional HPC clusters, a compute node is usually assigned to only one application. A large amount of computing resources may be wasted by idleness. Therefore, a time-sharing approach where one node can run multiple applications would be a better choice. Without good scheduling, tightly coupled applications suffer because they may be distributed in different time slices. To address this problem, distributed gang scheduling systems are designed to improve the performance of the entire system by having the same application scheduled at all relevant nodes at the same time.
This work includes the userland-implementation for a distributed gang scheduling system that works based on a client-server model. Scheduling is distributed in advance to ensure that all nodes receive schedules in a timely manner. NTP protocol is used to synchronise the time information of all nodes.
In this defence, I will outline the motivation, technical background, design and implementation of my Distributed Gang System, and explain the results of the evaluation on scalability as well as synchronisation through testing in Barnard as well as FFMK systems.