引言

kubeflow 是 google 开源的一个基于 kubernetes 的 ML workflow 平台，其集成了大量的机器学习工具，比如用于交互性实验的 jupyterlab 环境，用于超参数调整的 katib，用于 pipeline 工作流控制的 argo workflow等。作为一个“大型工具箱”集合，kubeflow 为机器学习开发者提供了大量可选的工具，同时也为机器学习的工程落地提供了可行性工具。

当然，由于 kubeflow 主要是 google 在主导，虽然其作为一个开源项目，但在很多选型上都和 google 自家产品深度绑定，比如 google 自己的存储工具 gstuil 作为一等公民，镜像仓库地址也大多是 grc.io这样的google自己的镜像仓库地址。

安装部署

关于 kubeflow 的安装部署如果没有比较好的外网访问环境的话，大家可以参考我开源的一个project，专门做国内manifest，镜像仓库都是采用阿里云镜像，在国内网络环境下也能快速轻松安装部署：

git clone https://github.com/shikanon/kubeflow-manifests.git
cd kubeflow-manifests
python install.py

安装完成后查看等待所有 pod running。

$ kubectl get po -A
NAMESPACE                   NAME                                                              READY   STATUS                  RESTARTS   AGE
auth                        dex-6686f66f9b-54s96                                              1/1     Running                 0          2h6m
cattle-system               cattle-cluster-agent-5f695c79c-x9ql7                              1/1     Running                 0          3h
cert-manager                cert-manager-9d5774b59-4xjmk                                      1/1     Running                 0          2h23m
cert-manager                cert-manager-cainjector-67c8c5c665-nmcp6                          1/1     Running                 0          2h23m
cert-manager                cert-manager-webhook-75dc9757bd-z2k5c                             1/1     Running                 1          2h23m
fleet-system                fleet-agent-7d959597cb-q8ckq                                      1/1     Running                 0          3h
istio-system                authservice-0                                                     1/1     Running                 0          2h23m
istio-system                cluster-local-gateway-66bcf8bc5d-j9kvp                            1/1     Running                 0          2h23m
istio-system                istio-ingressgateway-85b49c758f-l4hgc                             1/1     Running                 0          2h22m
istio-system                istiod-5ff6cdbbcd-2v5kj                                           1/1     Running                 0          2h23m
knative-eventing            broker-controller-5c84984b97-86zkx                                1/1     Running                 0          2h23m
knative-eventing            eventing-controller-54bfbd5446-rx9ll                              1/1     Running                 0          2h23m
knative-eventing            eventing-webhook-58f56d9cf4-bnq9q                                 1/1     Running                 0          2h23m
knative-eventing            imc-controller-769896c7db-kzjv6                                   1/1     Running                 0          2h23m
knative-eventing            imc-dispatcher-86954fb4cd-9b6gz                                   1/1     Running                 0          2h23m
knative-serving             activator-75696c8c9-9c5ff                                         1/1     Running                 0          2h23m
knative-serving             autoscaler-6764f9b5c5-2gwqj                                       1/1     Running                 0          2h23m
knative-serving             controller-598fd8bfd7-bpn5k                                       1/1     Running                 0          2h23m
knative-serving             istio-webhook-785bb58cc6-ts9f2                                    1/1     Running                 0          2h23m
knative-serving             networking-istio-77fbcfcf9b-pg26h                                 1/1     Running                 0          2h23m
knative-serving             webhook-865f54cf5f-rzpjf                                          1/1     Running                 0          2h23m
kube-system                 coredns-5644d7b6d9-hwwnr                                          1/1     Running                 0          3h1m
kube-system                 coredns-5644d7b6d9-zds92                                          1/1     Running                 0          3h1m
kube-system                 etcd-kubeflow-control-plane                                       1/1     Running                 0          3h
kube-system                 kindnet-8tvm5                                                     1/1     Running                 0          3h1m
kube-system                 kindnet-zkmkq                                                     1/1     Running                 0          3h1m
kube-system                 kube-apiserver-kubeflow-control-plane                             1/1     Running                 0          3h
kube-system                 kube-controller-manager-kubeflow-control-plane                    1/1     Running                 0          3h
kube-system                 kube-proxy-c8zn7                                                  1/1     Running                 0          3h1m
kube-system                 kube-proxy-k7b8c                                                  1/1     Running                 0          3h1m
kube-system                 kube-scheduler-kubeflow-control-plane                             1/1     Running                 0          3h
kubeflow                    admission-webhook-deployment-6fb9d65887-pzvgc                     1/1     Running                 0          2h22m
kubeflow                    cache-deployer-deployment-7558d65bf4-jhgwg                        2/2     Running                 1          2h6m
kubeflow                    cache-server-c64c68ddf-stz72                                      2/2     Running                 0          22m
kubeflow                    centraldashboard-7b7676d8bd-g2s8j                                 1/1     Running                 0          2h7m
kubeflow                    jupyter-web-app-deployment-66f74586d9-scbsm                       1/1     Running                 0          2h5m
kubeflow                    katib-controller-77675c88df-mx4rh                                 1/1     Running                 0          2h22m
kubeflow                    katib-db-manager-646695754f-z797r                                 1/1     Running                 0          2h22m
kubeflow                    katib-mysql-5bb5bd9957-gbl5t                                      1/1     Running                 0          2h22m
kubeflow                    katib-ui-55fd4bd6f9-r98r2                                         1/1     Running                 0          2h22m
kubeflow                    kfserving-controller-manager-0                                    2/2     Running                 0          2h22m
kubeflow                    kubeflow-pipelines-profile-controller-5698bf57cf-btpn5            1/1     Running                 0          22m
kubeflow                    metacontroller-0                                                  1/1     Running                 0          2h7m
kubeflow                    metadata-envoy-deployment-76d65977f7-rmlzc                        1/1     Running                 0          2h7m
kubeflow                    metadata-grpc-deployment-697d9c6c67-j6dl2                         2/2     Running                 3          2h7m
kubeflow                    metadata-writer-58cdd57678-8t6gw                                  2/2     Running                 1          2h7m
kubeflow                    minio-6d6784db95-tqs77                                            2/2     Running                 0          2h7m
kubeflow                    ml-pipeline-85fc99f899-plsz2                                      2/2     Running                 1          2h7m
kubeflow                    ml-pipeline-persistenceagent-65cb9594c7-xvn4j                     2/2     Running                 1          2h7m
kubeflow                    ml-pipeline-scheduledworkflow-7f8d8dfc69-7wfs4                    2/2     Running                 0          2h7m
kubeflow                    ml-pipeline-ui-5c765cc7bd-4r2j7                                   2/2     Running                 0          2h7m
kubeflow                    ml-pipeline-viewer-crd-5b8df7f458-5b8qg                           2/2     Running                 1          2h7m
kubeflow                    ml-pipeline-visualizationserver-56c5ff68d5-92bkf                  2/2     Running                 0          2h7m
kubeflow                    mpi-operator-789f88879-n4xms                                      1/1     Running                 0          2h22m
kubeflow                    mxnet-operator-7fff864957-vq2bg                                   1/1     Running                 0          2h22m
kubeflow                    mysql-56b554ff66-kd7bd                                            2/2     Running                 0          2h7m
kubeflow                    notebook-controller-deployment-74d9584477-qhpp8                   1/1     Running                 0          2h22m
kubeflow                    profiles-deployment-67b4666796-k7t2h                              2/2     Running                 0          2h22m
kubeflow                    pytorch-operator-fd86f7694-dxbgf                                  2/2     Running                 0          2h22m
kubeflow                    tensorboard-controller-controller-manager-fd6bcffb4-k9qvx         3/3     Running                 1          2h22m
kubeflow                    tensorboards-web-app-deployment-78d7b8b658-dktc6                  1/1     Running                 0          2h22m
kubeflow                    tf-job-operator-7bc5cf4cc7-gk8tz                                  1/1     Running                 0          2h22m
kubeflow                    volumes-web-app-deployment-68fcfc9775-bz9gq                       1/1     Running                 0          2h22m
kubeflow                    workflow-controller-5449754fb4-tdg2t                              2/2     Running                 1          22m
kubeflow                    xgboost-operator-deployment-5c7bfd57cc-9rtq6                      2/2     Running                 1          2h22m
local-path-storage          local-path-provisioner-58f6947c7-mv4mg                            1/1     Running                 0          3h1m

访问控制

kubeflow 通过dex 进行鉴权服务，安装好kubeflow，打开本地浏览器，看到 dex 的登录验证框，输出账号密码：

dex登录界面.png

这里的账号密码可以通过 dex 的 configmap 设置：

apiVersion: v1
data:
  config.yaml: |
    issuer: http://dex.auth.svc.cluster.local:5556/dex
    storage:
      type: kubernetes
      config:
        inCluster: true
    web:
      http: 0.0.0.0:5556
    logger:
      level: "debug"
      format: text
    oauth2:
      skipApprovalScreen: true
    enablePasswordDB: true
    staticPasswords:
    - email: "admin@example.com"
      hash: "$2a$10$2b2cU8CPhOTaGrs1HRQuAueS7JTT5ZHsHSzYiFPm1leZck7Mc8T4W"
      username: "admin"
      userID: "08a8684b-db88-4b73-90a9-3cd1661f5466"
    staticClients:
    - idEnv: OIDC_CLIENT_ID
      redirectURIs: ["/login/oidc"]
      name: 'Dex Login Application'
      secretEnv: OIDC_CLIENT_SECRET
kind: ConfigMap
metadata:
  name: dex
  namespace: auth

email 就是我们登录的用户名，hash 就是我们的设置的密码，可以通过以下这段python代码来生成：

from passlib.hash import bcrypt
import getpass
print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))

组件功能介绍

可以看到新版的kubeflow多了很多功能。

这里按模块介绍下 Kubeflow 的几个核心组件。

Notebook Servers，作为一个管理线上交互实验的记录工具，可以帮助算法人员快速完成算法实验，同时notebook server 提供了统一的文档管理能力。
AutoML，提供自动化的服务，对特征处理、特征选择、模型选择、模型参数的配置、模型训练和评估等方面，实现了全自动建模，降低算法人员手动实验次数。
Pipeline，提供一个算法流水线的工程化工具，将算法各流程模块以拓扑图的形式组合起来，同时结合 argo 可以实现 MLOps。
Serverless，将模型直接发布成一个对外的服务，缩短从实验到生产的路径。

kubeflow组件.png

Notebook Servers

notebook 可以说是做机器学习最喜欢用到的工具了，完美的将动态语言的交互性发挥出来，kubeflow 提供了 jupyter notebook 来快速构建云上的实验环境，这里以一个我们自定义的镜像为例：

我们创建了一个test-for-jupyter名字的镜像，配置了一个 tensorflow 的镜像，点击启动，我们可以看到在kubeflow-user-example-com命名空间下已经创建我们的应用了：

kubectl get po -nkubeflow-user-example-com
NAME                                               READY   STATUS            RESTARTS   AGE
ml-pipeline-ui-artifact-6d7ffcc4b6-9kxkk           2/2     Running           0          48m
ml-pipeline-visualizationserver-84d577b989-5hl46   2/2     Running           0          48m
test-for-jupyter-0                                 0/2     PodInitializing   0          44s

创建完成后点击 connect 就可以进入我们创建的应用界面中了

在 jupyterlab 环境中开发人员可以很方便的进行算法实验，同时由于运行在云上利用 k8s api甚至可以很方便构建k8s资源，比如通过 kfserving 创建一个ML服务。

AutoML

AutoML 是机器学习比较热的领域，主要用来模型自动优化和超参数调整，这里其实是用的 Katib来实现的，一个基于k8s的 AutoML 项目，详细见https://github.com/kubeflow/katib。

Katib 主要提供了超参数调整(Hyperparameter Tuning)，早停法(Early Stopping)和神经网络架构搜索(Neural Architecture Search)

这里以一个随机搜索算法为例：

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: kubeflow-user-example-com
  name: random-example
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-accuracy
    additionalMetricNames:
      - Train-accuracy
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: num-layers
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - sgd
          - adam
          - ftrl
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: numberLayers
        description: Number of training model layers
        reference: num-layers
      - name: optimizer
        description: Training model optimizer (sdg, adam or ftrl)
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
                command:
                  - "python3"
                  - "/opt/mxnet-mnist/mnist.py"
                  - "--batch-size=64"
                  - "--lr=${trialParameters.learningRate}"
                  - "--num-layers=${trialParameters.numberLayers}"
                  - "--optimizer=${trialParameters.optimizer}"
            restartPolicy: Never

这里以一个简单的神经网络为例，该程序具有三个参数 lr, num-layers, optimizer，采用的算法是随机搜索，目标是最大化准确率(accuracy)。

可以直接在界面中填上yaml文件，然后提交。

完成后会生成一张各参数和准确率的关系图和训练列表：

Experiments and Pipelines

experiments 为我们提供了一个可以创建实验空间功能， pipeline 定义了算法组合的模板，通过 pipeline 我们可以将算法中各处理模块按特定的拓扑图的方式组合起来。

这里可以看看官方提供的几个 pipeline 例子：

kubeflow pipeline 本质是基于 argo workflow 实现，由于我们的kubeflow是基于kind上构建的，容器运行时用的containerd，而workflow默认的pipeline执行器是docker，因此有些特性不兼容，这块可以见 argo workflow 官方说明：https://argoproj.github.io/argo-workflows/workflow-executors/。
这里我是把 workflow 的 containerRuntimeExecutor 改成了 k8sapi。但 k8sapi 由于在 workflow 是二级公民，因此有些功能不能用，比如 kubeflow pipeline 在 input/output 的 artifacts 需要用到 docker cp 命令，可以参考这个issue: https://github.com/argoproj/argo-workflows/issues/2685#issuecomment-613632304

由于以上原因 kubeflow 默认给的几个案例并没有用 volumes 是无法在 kind 中运行起来，这里我们基于 argo workflow 语法自己实现一个 pipeline

基于pipeline构建一个的工作流水

第一步，构建一个 workflow pipeline 文件：

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: kubeflow-test-
spec:
  entrypoint: kubeflow-test
  templates:
  - name: kubeflow-test
    dag:
      tasks:
      - name: print-text
        template: print-text
        dependencies: [repeat-line]
      - {name: repeat-line, template: repeat-line}
  - name: repeat-line
    container:
      args: [--line, Hello, --count, '15', --output-text, /gotest/outputs/output_text/data]
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def _make_parent_dirs_and_return_path(file_path: str):
            import os
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            return file_path

        def repeat_line(line, output_text_path, count = 10):
            '''Repeat the line specified number of times'''
            with open(output_text_path, 'w') as writer:
                for i in range(count):
                    writer.write(line + '\n')

        import argparse
        _parser = argparse.ArgumentParser(prog='Repeat line', description='Repeat the line specified number of times')
        _parser.add_argument("--line", dest="line", type=str, required=True, default=argparse.SUPPRESS)
        _parser.add_argument("--count", dest="count", type=int, required=False, default=argparse.SUPPRESS)
        _parser.add_argument("--output-text", dest="output_text_path", type=_make_parent_dirs_and_return_path, required=True, default=argparse.SUPPRESS)
        _parsed_args = vars(_parser.parse_args())

        _outputs = repeat_line(**_parsed_args)
      image: python:3.7
      volumeMounts:
      - name: workdir
        mountPath: /gotest/outputs/output_text/
    volumes:
      - name: workdir
        persistentVolumeClaim:
          claimName: kubeflow-test-pv
    metadata:
      annotations: 
  - name: print-text
    container:
      args: [--text, /gotest/outputs/output_text/data]
      command:
      - sh
      - -ec
      - |
        program_path=$(mktemp)
        printf "%s" "$0" > "$program_path"
        python3 -u "$program_path" "$@"
      - |
        def print_text(text_path): # The "text" input is untyped so that any data can be printed
            '''Print text'''
            with open(text_path, 'r') as reader:
                for line in reader:
                    print(line, end = '')

        import argparse
        _parser = argparse.ArgumentParser(prog='Print text', description='Print text')
        _parser.add_argument("--text", dest="text_path", type=str, required=True, default=argparse.SUPPRESS)
        _parsed_args = vars(_parser.parse_args())

        _outputs = print_text(**_parsed_args)
      image: python:3.7
      volumeMounts:
      - name: workdir
        mountPath: /gotest/outputs/output_text/
    volumes:
      - name: workdir
        persistentVolumeClaim:
          claimName: kubeflow-test-pv
    metadata:
      annotations:

argo workflow 的语法可以参考：https://argoproj.github.io/argo-workflows/variables/

这里我们定义了两个任务 repeat-line 和 print-text, repeat-line 任务会将生产结果写入 kubeflow-test-pv 的 PVC 中， print-text 会从 PVC 中读取数据输出到 stdout。

这里由于用到 PVC，我们需要先在集群中创建一个kubeflow-test-pv的PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kubeflow-test-pv
  namespace: kubeflow-user-example-com
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 128Mi

第二步，定义好 pipeline 文件后可以创建pipeline：

第三步，启动一个pipeline：

启动 pipeline 除了单次运行模式 one-off，也支持定时器循环模式 Recurring，这块可以根据自己的需求确定。

查看运行结果：

运行完后，可以将实验进行归档(Archived)。

关于 MLOps 的一点思考

我们来看一个简单的 ML 运作流程：

这是一个 google 提供的 level 1 级别的机器学习流水线自动化，整个流水线包括以下几部分：

构建快速算法实验的环境(experimentation)，这里的步骤已经过编排，各个步骤之间的转换是自动执行的，这样可以快速迭代实验，并更好地准备将整个流水线移至生产环境，在这个环境中算法研究员只进行模块内部的工作。
构建可复用的生产环境流水线，组件的源代码模块化，实验环境模块化流水线可以直接在 staging 环境和 production 环境中使用。
持续交付模型，生产环境中的机器学习流水线会向使用新数据进行训练的新模型持续交付预测服务。

基于上述功能描述我们其实可以基于 kubeflow 的 pipeline 和 kfserving 功能轻松实现一个简单的 MLOps 流水线发布流程。不过，值得注意的是，DevOps 本身并不仅仅是一种技术，同时是一种工程文化，所以在实践落地中需要团队各方的协同分阶段的落地。这块可以参考《MLOps: Continuous delivery and automation pipelines in machine learning》和《Hidden Technical Debt in Machine Learning Systems》

玩转Kubeflow第一章: kubeflow 国内本地安装及案例介绍

引言

安装部署

访问控制

组件功能介绍

Notebook Servers

AutoML

Experiments and Pipelines

基于pipeline构建一个的工作流水

关于 MLOps 的一点思考

参考文献

阿里云MVP

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

玩转Kubeflow第一章: kubeflow 国内本地安装及案例介绍

引言

安装部署

访问控制

组件功能介绍

Notebook Servers

AutoML

Experiments and Pipelines

基于pipeline构建一个的工作流水

关于 MLOps 的一点思考

参考文献

阿里云MVP

热门文章

最新文章

相关课程

相关电子书

相关实验场景