Gang Scheduling is a critical feature for Deep Learning workloads to enable all-or-nothing scheduling capability, as most DL frameworks requires all workers to be running to start training process. Gang Scheduling avoids resource inefficiency and scheduling deadlock sometimes.
KubeDL supports gang scheduling with different schedulers as a backend. Today, several Kubernetes schedulers support gang scheduling, including the Coscheduling Scheduling Plugin, YuniKorn, Volcano, KubeBatch. Each has its own advantages and its own API protocols.
KubeDL provides a plugin framework to support different schedulers as a backend. Currently, KubeDL supports kube-coscheduler(popular on alibaba cloud), volcano(a batch system under CNCF) and kube-batch.
How to Enable
Enable gang scheduling using the KubeDL controller startup flag
By default, it is empty meaning not enabled. Supported values are