KubeDL supports running distributed training jobs on Kubernentes such as TensorFlow, PyTorch, XGBoost, Mars and MPI.
- Support different kinds of deep learning training jobs in a single controller. You don't need to run each controller for each job kind.
- Expose unified prometheus metrics for job stats.
- Persist job metadata and events in external storage such as Mysql or certain event DB to outlive api-server state.
- Sync files on container launch. You no longer need to rebuild the image to include the modified code every time.
- Enable training worker service discovery with host network. Host network is ideal for worker communication performance or nvlink communication across containers.
- Support advanced scheduling features such as gang scheduling.
- Support attaching Tensorboard to a running or finished job.
- Support training acceleration using cache by integrating with Fluid.
- A user friendly dashboard !
There are two ways to install KubeDL.
Install using Helm
Install using YAML files
References for apis, metrics etc. Reference →