尝鲜阿里云容器服务Kubernetes 1.16，共享TensorFlow实验室-阿里云开发者社区

简介

TensorFLow是深度学习和机器学习最流行的开源框架，它最初是由Google研究团队开发的并致力于解决深度神经网络的机器学习研究，从2015年开源到现在得到了广泛的应用。特别是Tensorboard这一利器，对于数据科学家有效的工作也是非常有效的利器。
Jupyter notebook是强大的数据分析工具，它能够帮助快速开发并且实现机器学习代码的共享，是数据科学团队用来做数据实验和组内合作的利器，也是机器学习初学者入门这一个领域的好起点。
利用Jupyter开发TensorFlow也是许多数据科学家的首选，但是如何能够快速从零搭建一套这样的环境，并且配置GPU的使用，同时支持最新的TensorFlow版本, 对于数据科学家来说既是复杂的，同时也是浪费精力的。
在Kubernetes集群上，您可以快速的部署一套完整Jupyter Notebook环境，进行模型开发。这个方案唯一的问题在于这里的GPU资源是独享，造成较大的浪费。数据科学家使用notebook实验的时候GPU显存需求量并不大，如果可以能够多人共享同一个GPU可以降低模型开发的成本。

而阿里云容器服务团队推出了GPU共享方案，可以在模型开发和模型推理的场景下大大提升GPU资源的利用率，同时也可以保障GPU资源的隔离。

独享GPU的处理办法

首先我们回顾下以前调度GPU的情况

为集群添加一个新的gpu节点

创建容器服务集群
添加GPU节点作为worker

本例中我们选择GPU机器规格“ecs.gn6i-c4g1.xlarge”
添加后结果如下"cn-zhangjiakou.192.168.3.189"

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get node -L cgpu,workload_type
NAME                           STATUS   ROLES    AGE     VERSION            CGPU   WORKLOAD_TYPE
cn-zhangjiakou.192.168.0.138   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.112   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.1.113   Ready    <none>   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.115   Ready    master   11d     v1.16.6-aliyun.1
cn-zhangjiakou.192.168.3.189   Ready    <none>   5m52s   v1.16.6-aliyun.1
        
          
        
        
        
          
          AI 代码解读

部署应用

通过命令 kubectl apply -f gpu_deployment.yaml 来部署应用，gpu_deployment.yaml文件内容如下

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-gpu
  labels:
    app: tf-notebook-gpu
spec:
  replicas: 2
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-gpu
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-gpu
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-gpu
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-gpu
  type: LoadBalancer
        
          
        
        
        
          
          AI 代码解读

因为只有一个GPU节点，而上面的yaml文件中申请了两个Pod，我们看到如下pod的调度情况，
可以看到第二个pod的状态是pending，原因是无对应资源来进行调度，也即是说只能一个Pod“独占”该节点的GPU资源。

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod
NAME                               READY   STATUS    RESTARTS   AGE
tf-notebook-2-7b4d68d8f7-mb852     1/1     Running   0          15h
tf-notebook-3-86c48d4c7d-flz7m     1/1     Running   0          15h
tf-notebook-7cf4575d78-sxmfl       1/1     Running   0          23h
tf-notebook-gpu-695cb6cf89-dsjmv   1/1     Running   0          6s
tf-notebook-gpu-695cb6cf89-mwm98   0/1     Pending   0          6s
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl describe pod tf-notebook-gpu-695cb6cf89-mwm98
Name:           tf-notebook-gpu-695cb6cf89-mwm98
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=tf-notebook-gpu
                pod-template-hash=695cb6cf89
Annotations:    kubernetes.io/psp: ack.privileged
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/tf-notebook-gpu-695cb6cf89
Containers:
  tf-notebook:
    Image:      tensorflow/tensorflow:1.4.1-gpu-py3
    Port:       8888/TCP
    Host Port:  0/TCP
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      PASSWORD:  mypassw0rd
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-wpwn8 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-wpwn8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-wpwn8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   <unknown>          default-scheduler   0/6 nodes are available: 6 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling   <unknown>          default-scheduler   0/6 nodes are available: 6 Insufficient nvidia.com/gpu.
        
          
        
        
        
          
          AI 代码解读

真实的程序

在jupyter里执行下面的程序

import argparse

import tensorflow as tf

FLAGS = None

def train(fraction=1.0):
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction

    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction
    sess = tf.Session(config=config)
    # Runs the op.
    while True:
        sess.run(c)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--total', type=float, default=1000,
                      help='Total GPU memory.')
    parser.add_argument('--allocated', type=float, default=1000,
                      help='Allocated GPU memory.')
    FLAGS, unparsed = parser.parse_known_args()
    # fraction = FLAGS.allocated / FLAGS.total * 0.85
    fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 )

    print(fraction) # fraction 默认值为0.7，该程序最多使用总资源的70%
    train(fraction)
        
          
        
        
        
          
          AI 代码解读

通过托管版本Prometheus可以看到，在运行时其使用了整机资源的70%，

独享GPU方案的问题

综上所述，独享GPU调度方案存在的问题是在推理、教学等对GPU用量不大的场景中不能将更多的Pod调度在一起，完成GPU的共享
为了解决这些问题我们引入了GPU共享的方案，以便更好的利用GPU资源，提供更密集的部署能力、更高的GPU使用率、完整的隔离能力。

GPU共享方案

环境准备

前提条件

配置	支持版本
Kubernetes	1.16.06；专属集群-master节点需要在客户的VPC内
Helm版本	3.0及以上版本
Nvidia驱动版本	418.87.01及以上版本
Docker版本	19.03.5
操作系统	CentOS 7.6、CentOS 7.7、Ubuntu 16.04和Ubuntu 18.04
支持显卡	Telsa P4、Telsa P100、 Telsa T4和Telsa v100（16GB）

创建集群

添加GPU节点

本文中使用的GPU节点规格为 ecs.gn6i-c4g1.xlarge

设置节点为GPU共享节点--为GPU节点打标

登录容器服务管理控制台。
在控制台左侧导航栏中，选择集群 > 节点
在节点列表页面，选择目标集群并单击页面右上角标签管理。
在标签管理页面，批量选择节点，然后单击添加标签。

在弹出的添加对话框中，填写标签名称和值。

注意请确保名称设置为cgpu，值设置为true。

单击确定。

为集群安装CGPU组件

登录容器服务管理控制台。
在控制台左侧导航栏中，选择市场 > 应用目录。
在应用目录页面，选中并单击ack-cgpu。
在应用目录-ack-cgpu页面右侧的创建面板中，选中目标集群，然后单击创建。您无需设置命名空间和发布名称，系统显示默认值。

您可以执行命令helm get manifest cgpu -n kube-system | kubectl get -f -查看cGPU组件是否安装成功。当出现以下命令详情时，说明cGPU组件安装成功。

# helm get manifest cgpu -n kube-system | kubectl get -f -
NAME                                    SECRETS   AGE
serviceaccount/gpushare-device-plugin   1         39s
serviceaccount/gpushare-schd-extender   1         39s
NAME                                                           AGE
clusterrole.rbac.authorization.k8s.io/gpushare-device-plugin   39s
clusterrole.rbac.authorization.k8s.io/gpushare-schd-extender   39s
NAME                                                                  AGE
clusterrolebinding.rbac.authorization.k8s.io/gpushare-device-plugin   39s
clusterrolebinding.rbac.authorization.k8s.io/gpushare-schd-extender   39s
NAME                             TYPE       CLUSTER-IP    EXTERNAL-IP   PORT(S)           AGE
service/gpushare-schd-extender   NodePort   10.6.13.125   <none>        12345:32766/TCP   39s
NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR    AGE
daemonset.apps/cgpu-installer              4         4         4       4            4           cgpu=true        39s
daemonset.apps/device-plugin-evict-ds      4         4         4       4            4           cgpu=true        39s
daemonset.apps/device-plugin-recover-ds    0         0         0       0            0           cgpu=false   39s
daemonset.apps/gpushare-device-plugin-ds   4         4         4       4            4           cgpu=true        39s
NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpushare-schd-extender   1/1     1            1           38s
NAME                           COMPLETIONS   DURATION   AGE
job.batch/gpushare-installer   3/1 of 3      3s         38s
        
          
        
        
        
          
          AI 代码解读

安装arena查看资源情况

安装arena

@ linux

wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz
tar -xzvf arena-installer-0.4.0-829b0e9-linux-amd64.tar.gz
sh ./arena-installer/install.sh
        
          
        
        
        
          
          AI 代码解读

@ mac

wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz
tar -xzvf arena-installer-0.4.0-829b0e9-darwin-amd64.tar.gz
sh ./arena-installer/install.sh
        
          
        
        
        
          
          AI 代码解读

查看资源情况

jumper(⎈ |zjk-gpu:default)➜  ~ arena top node
NAME                          IPADDRESS      ROLE    STATUS  GPU(Total)  GPU(Allocated)  GPU(Shareable)
cn-zhangjiakou.192.168.0.138  192.168.0.138  master  ready   0           0               No
cn-zhangjiakou.192.168.1.112  192.168.1.112  master  ready   0           0               No
cn-zhangjiakou.192.168.1.113  192.168.1.113  <none>  ready   0           0               No
cn-zhangjiakou.192.168.3.115  192.168.3.115  master  ready   0           0               No
cn-zhangjiakou.192.168.3.184  192.168.3.184  <none>  ready   1           0               Yes
------------------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/1 (0%)
jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s
NAME                          IPADDRESS      GPU0(Allocated/Total)
cn-zhangjiakou.192.168.3.184  192.168.3.184  0/14
---------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
0/14 (GiB) (0%)
        
          
        
        
        
          
          AI 代码解读

如上所示
节点cn-zhangjiakou.192.168.3.184 有1个GPU资源, 设置了 GPU(Shareable)--即在节点上打标签cgpu=true，其上有14个显存资源

运行TensorFLow的GPU实验环境

将如下文件存储为 mem_deployment.yaml，通过kubectl执行 kubectl apply -f mem_deployment.yaml部署应用

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook
  labels:
    app: tf-notebook
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook
  type: LoadBalancer
        
          
        
        
        
          
          AI 代码解读

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment.yaml
deployment.apps/tf-notebook created
service/tf-notebook created
jumper(⎈ |zjk-gpu:default)➜  ~  kubectl get svc tf-notebook
NAME          TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
tf-notebook   LoadBalancer   172.21.2.50   39.100.193.19   80:32285/TCP   78m
        
          
        
        
        
          
          AI 代码解读

访问http://${EXTERNAL-IP}/ 来访问目标

Deployment配置：

nvidia.com/gpu 指定调用nvidia gpu的数量

环境变量 PASSWORD 指定了访问Jupyter服务的密码，您可以按照您的需要修改，默认“mypassw0rd”

现在要验证这个Jupyter实例可以使用GPU，可以在运行下面的程序。它将列出Tensorflow可用的所有设备。

from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]

print(get_available_devices())
        
          
        
        
        
          
          AI 代码解读

可以看到如下输出，资源位GPU:0

在首页创建新的terminal

执行 nvidia-smi

可以看到在Pod上资源上限是4308MiB

验证GPU资源的共享

以上部分可以看出新的资源“aliyun.com/gpu-mem: 4”可以正常的申请的GPU资源，并运行对应的GPU任务，下面来看GPU资源共享的情况。

资源使用情况查看

首先，现有资源使用情况如下 arena top node -s -d

jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s -d

NAME:       cn-zhangjiakou.192.168.3.184
IPADDRESS:  192.168.3.184

NAME                            NAMESPACE  GPU0(Allocated)
tf-notebook-2-7b4d68d8f7-wxlff  default    4
tf-notebook-3-86c48d4c7d-lk9h8  default    4
tf-notebook-7cf4575d78-9gxzd    default    4
Allocated :                     12 (85%)
Total :                         14
--------------------------------------------------------------------------------------------------------------------------------------


Allocated/Total GPU Memory In GPUShare Node:
12/14 (GiB) (85%)
        
          
        
        
        
          
          AI 代码解读

如上所示每个节点显存资源为14，可以调度3个pod.

部署更多的服务和副本

为了每个notebook能够有自己的入口，我们申请三个服务，指向三个pod,yaml文件如下
ps: mem_deployment-2.yaml、mem_deployment-3.yaml与mem_deployment.yaml内容几乎一致，只是把不同的svc指向不同的pod

mem_deployment-2.yaml

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-2
  labels:
    app: tf-notebook-2
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-2
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-2
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-2
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-2
  type: LoadBalancer
        
          
        
        
        
          
          AI 代码解读

mem_deployment-3.yaml

---
# Define the tensorflow deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tf-notebook-3
  labels:
    app: tf-notebook-3
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tf-notebook-3
  template: # define the pods specifications
    metadata:
      labels:
        app: tf-notebook-3
    spec:
      containers:
      - name: tf-notebook
        image: tensorflow/tensorflow:1.4.1-gpu-py3
        resources:
          limits:
            aliyun.com/gpu-mem: 4
          requests:
            aliyun.com/gpu-mem: 4
        ports:
        - containerPort: 8888
        env:
          - name: PASSWORD
            value: mypassw0rd

# Define the tensorflow service
---
apiVersion: v1
kind: Service
metadata:
  name: tf-notebook-3
spec:
  ports:
  - port: 80
    targetPort: 8888
    name: jupyter
  selector:
    app: tf-notebook-3
  type: LoadBalancer
        
          
        
        
        
          
          AI 代码解读

应用两个yaml文件，加上之前部署的pod和服务共计在集群上部署3个Pod和3个服务

jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment-2.yaml
deployment.apps/tf-notebook-2 created
service/tf-notebook-2 created
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl apply -f mem_deployment-3.yaml
deployment.apps/tf-notebook-3 created
service/tf-notebook-3 created
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get svc
NAME            TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)        AGE
kubernetes      ClusterIP      172.21.0.1    <none>          443/TCP        11d
tf-notebook     LoadBalancer   172.21.2.50   39.100.193.19   80:32285/TCP   7h48m
tf-notebook-2   LoadBalancer   172.21.1.46   39.99.218.255   80:30659/TCP   8m53s
tf-notebook-3   LoadBalancer   172.21.8.56   39.98.242.180   80:31274/TCP   7s
jumper(⎈ |zjk-gpu:default)➜  ~ kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP             NODE                           NOMINATED NODE   READINESS GATES
tf-notebook-2-7b4d68d8f7-mb852   1/1     Running   0          9m6s    172.20.64.21   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-3-86c48d4c7d-flz7m   1/1     Running   0          20s     172.20.64.22   cn-zhangjiakou.192.168.3.184   <none>           <none>
tf-notebook-7cf4575d78-sxmfl     1/1     Running   0          7h49m   172.20.64.14   cn-zhangjiakou.192.168.3.184   <none>           <none>
jumper(⎈ |zjk-gpu:default)➜  ~ arena top node -s
NAME                          IPADDRESS      GPU0(Allocated/Total)
cn-zhangjiakou.192.168.3.184  192.168.3.184  12/14
----------------------------------------------------------------------
Allocated/Total GPU Memory In GPUShare Node:
12/14 (GiB) (85%)
        
          
        
        
        
          
          AI 代码解读

查看最终结果

如上所示
通过kubectl get pod -o wide 可以看到在cn-zhangjiakou.192.168.3.184 节点上有3个pod运行
通过 arena top node -s 可以看到cn-zhangjiakou.192.168.3.184节点上的显存资源使用了 12/14
在不同的服务上开启终端，通过nvidia-smi来查看GPU资源，每个Pod的上限都是4308MiB

在节点cn-zhangjiakou.192.168.3.184 上运行如下命令，查看节点上的资源情况

[root@iZ8vb4lox93w3mhkqmdrgsZ ~]# nvidia-smi
Wed May 27 12:19:25 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   49C    P0    29W /  70W |   4019MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11563      C   /usr/bin/python3                            4009MiB |
+-----------------------------------------------------------------------------+
        
          
        
        
        
          
          AI 代码解读

由此可以看出通过使用cgpu的模式可以在同一个节点上部署更多的使用GPU资源的Pod，而“普通的调度一个GPU node 只能负载一个pod”

真实的程序

下面是一段可以持续运行使用GPU资源的代码，其中参数fraction 为申请显存占可用显存的比例，默认值为0.7，我们在3个pod的Jupyter里运行下面的程序

import argparse

import tensorflow as tf

FLAGS = None

def train(fraction=1.0):
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction

    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.
    config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = fraction
    sess = tf.Session(config=config)
    # Runs the op.
    while True:
        sess.run(c)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--total', type=float, default=1000,
                      help='Total GPU memory.')
    parser.add_argument('--allocated', type=float, default=1000,
                      help='Allocated GPU memory.')
    FLAGS, unparsed = parser.parse_known_args()
    # fraction = FLAGS.allocated / FLAGS.total * 0.85
    fraction = round( FLAGS.allocated * 0.7 / FLAGS.total , 1 )

    print(fraction) # fraction 默认值为0.7，该程序最多使用总资源的70%
    train(fraction)
        
          
        
        
        
          
          AI 代码解读

然后通过托管版Prometheus来观察具体的资源使用情况

如上图所示，每个Pod实际使用显存3.266GB，亦即每个Pod的使用的显存资源都限制到了4

总结

总结一下

通过给节点添加cgpu: true标签将节点设置为GPU共享型节点。
在pod中通过类型 aliyun.com/gpu-mem: 4 的资源来申请和限制单个pod使用的资源，进而达到GPU共享的目的，每个pod都可以提供完整的GPU能力; 而Node上的一个GPU资源分享给了3个Pod使用，利用率提升到300% -- 如果资源拆分更小，还可以达到更高的利用率。
通过 arena top node 、 arena top node -s 来查看GPU资源分配的情况
通过托管版Prometheus的“GPU APP” 大盘可以看到实际运行时使用的显存、GPU、温度、功率等信息。

参考信息

托管版本Prometheus https://help.aliyun.com/document_detail/122123.html
GPU共享方案CGPU https://help.aliyun.com/document_detail/163994.html
arena https://github.com/kubeflow/arena

尝鲜阿里云容器服务Kubernetes 1.16，共享TensorFlow实验室

简介

独享GPU的处理办法

为集群添加一个新的gpu节点

部署应用

真实的程序

独享GPU方案的问题

GPU共享方案

环境准备

前提条件

创建集群

添加GPU节点

设置节点为GPU共享节点--为GPU节点打标

为集群安装CGPU组件

安装arena查看资源情况

安装arena

@ linux

@ mac

查看资源情况

运行TensorFLow的GPU实验环境

验证GPU资源的共享

资源使用情况查看

部署更多的服务和副本

查看最终结果

真实的程序

总结

总结一下

参考信息

容器服务

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

尝鲜阿里云容器服务Kubernetes 1.16，共享TensorFlow实验室

简介

独享GPU的处理办法

为集群添加一个新的gpu节点

部署应用

真实的程序

独享GPU方案的问题

GPU共享方案

环境准备

前提条件

创建集群

添加GPU节点

设置节点为GPU共享节点--为GPU节点打标

为集群安装CGPU组件

安装arena查看资源情况

安装arena

@ linux

@ mac

查看资源情况

运行TensorFLow的GPU实验环境

验证GPU资源的共享

资源使用情况查看

部署更多的服务和副本

查看最终结果

真实的程序

总结

总结一下

参考信息

容器服务

热门文章

最新文章

相关产品

相关课程

相关电子书

相关实验场景

推荐镜像