Skip to main content

Introduction

KubeDL supports running distributed training jobs on Kubernentes such as TensorFlow, PyTorch, XGBoost, Mars and MPI.

Key Features

  • Support different kinds of deep learning training jobs in a single controller. You don't need to run each controller for each job kind.
  • Expose unified prometheus metrics for job stats.
  • Persist job metadata and events in external storage such as Mysql or certain event DB to outlive api-server state.
  • Sync files on container launch. You no longer need to rebuild the image to include the modified code every time.
  • Enable training worker service discovery with host network. Host network is ideal for worker communication performance or nvlink communication across containers.
  • Support advanced scheduling features such as gang scheduling.
  • Support attaching Tensorboard to a running or finished job.
  • Support training acceleration using cache by integrating with Fluid.
  • A user friendly dashboard !

Get started

There are two ways to install KubeDL.

Install using Helm

Go →

Install using YAML files

Go →

Reference

References for apis, metrics etc. Reference →