Metrics

KubeDL operator is instrumented with prometheus metrics.

Metric NameslabelDescription
kubedl_jobs_createdkindCounts number of jobs created
kubedl_jobs_deletedkindCounts number of jobs deleted
kubedl_jobs_successfulkindCounts number of jobs successfully finished
kubedl_jobs_failedkindCounts number of jobs failed
kubedl_jobs_restartedkindCounts number of jobs restarted
kubedl_jobs_runningkindCounts number of jobs currently running
kubedl_jobs_pendingkindCounts number of jobs currently pending
kubedl_jobs_first_pod_launch_delay_secondskind, name, namespace, uidHistogram for recording launch delay duration (from job created to first pod running)
kubedl_jobs_all_pods_launch_delay_secondskind, name, namespace, uidHistogram for recording launch delay duration (from job created to all pods running)

label specifics the labels supported for the corresponding prometheus metrics

  • kind - the target job kind, e.g. TFJob, PyTorchJob, MarsJob, XGBoostJob
  • name - the name of the job
  • namespace - the namespace of the job
  • uid - the uid of the job