Introduction

Currently, KubeDL supports running TensorFlow, PyTorch, XGBoost, Mars and MPI distributed training jobs on Kubernetes.

introduction
Introduction

Key Features

  • Support different kinds of deep learning training jobs in a single controller. You don’t need to run each controller for each job kind.
  • Expose unified prometheus metrics for job stats.
  • Persist job metadata and events in external storage such as Mysql or certain event DB to outlive api-server state.
  • Sync files on container launch. You no longer need to rebuild the image to include the modified code every time.
  • Run jobs with host network for performance or nvlink communication across containers.
  • Support advanced scheduling features such as gang scheduling.
  • Support Tensorboard out of the box.
  • A catchy dashboard !

Get started

There are two main ways to install KubeDL.

Install using Helm

Install KubeDL using Helm charts. Go →

Install using YAML files

Install KubeDL using YAML files. Go →

Features

Highlighted features in KubeDL. Features →

Reference

References for apis, metrics etc. Reference →

Contributing

Find out how to contribute to KubeDL. Contributing →

Help

Get help on KubeDL. Help →