Introduction
KubeDL Serving provides a group of user freindly APIs to construct online model inference services. It closely cooperates with training
and model
stages, making end-to-end deep learning development automatically.
KubeDL provides CRD Inference
to accomplish this:
Inference With Single Model
Inference describes an expeced inference service including adopted framework, predictor templates, autoscaling polices... An example YAML looks like below, this example shows how inference service serves single model:
apiVersion: serving.kubedl.io/v1alpha1
kind: Inference
metadata:
name: hello-inference
spec:
framework: TFServing
predictors:
- name: model-predictor
modelVersion: model
replicas: 3
autoScale:
minReplicas: 1
maxReplicas: 10
batching:
batchSize: 32
template:
spec:
containers:
- name: tensorflow
args:
- --port=9000
- --rest_api_port=8500
- --model_name=mnist
- --model_base_path=/kubedl-model/
command:
- /usr/bin/tensorflow_model_server
image: tensorflow/serving:1.11.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9000
- containerPort: 8500
resources:
limits:
cpu: 2048m
memory: 2Gi
requests:
cpu: 1024m
memory: 1Gi
Inference With Multiple Models
Inference is able to serve multiple models simultaneously, which may appers in serving different model versions and takes A/B tests, for example:
apiVersion: serving.kubedl.io/v1alpha1
kind: Inference
metadata:
name: hello-inference
spec:
framework: TFServing
predictors:
- name: model-a-predictor
modelVersion: model-a
replicas: 3
trafficPercentage: 90 # 90% traffic will be roted to this predictor.
autoScale:
minReplicas: 1
maxReplicas: 10
batching:
batchSize: 32
template:
spec:
containers:
- name: tensorflow
args:
- --port=9000
- --rest_api_port=8500
- --model_name=mnist
- --model_base_path=/kubedl-model/
command:
- /usr/bin/tensorflow_model_server
image: tensorflow/serving:1.11.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9000
- containerPort: 8500
resources:
limits:
cpu: 2048m
memory: 2Gi
requests:
cpu: 1024m
memory: 1Gi
- name: model-b-predictor
modelVersion: model-b
replicas: 3
trafficPercentage: 10 # 10% traffic will be roted to this predictor.
autoScale:
minReplicas: 1
maxReplicas: 10
batching:
batchSize: 64
template:
spec:
containers:
- name: tensorflow
args:
- --port=9000
- --rest_api_port=8500
- --model_name=mnist
- --model_base_path=/kubedl-model/
command:
- /usr/bin/tensorflow_model_server
image: tensorflow/serving:1.11.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9000
- containerPort: 8500
resources:
limits:
cpu: 2048m
memory: 2Gi
requests:
cpu: 1024m
memory: 1Gi