背景
我们知道,如果在Kubernetes中支持GPU设备调度,需要做如下的工作:
-
节点上安装nvidia驱动
-
节点上安装nvidia-docker
-
集群部署gpu device plugin,用于为调度到该节点的pod分配GPU设备。
除此之外,如果你需要监控集群GPU资源使用情况,你可能还需要安装DCCM exporter结合Prometheus输出GPU资源监控信息。
要安装和管理这么多的组件,对于运维人员来说压力不小。基于此,NVIDIA开源了一款叫NVIDIA GPU Operator的工具,该工具基于Operator Framework实现,用于自动化管理上面我们提到的这些组件。
NVIDIA GPU Operator有以下的组件构成:
-
安装nvidia driver的组件
-
安装nvidia container toolkit的组件
-
安装nvidia devcie plugin的组件
-
安装nvidia dcgm exporter组件
-
安装gpu feature discovery组件
本系列文章不打算一上来就开始讲NVIDIA GPU Operator,而是先把各个组件的安装详细的分析一下,然后手动安装这些组件,最后再来分析NVIDIA GPU Operator就比较简单了。
在本篇文章中,我们将介绍NVIDIA GPU Operator安装GPU Feature Discovery组件。
GPU Feature Discovery介绍
GPU Feature Discovery这个组件主要的作用是给GPU节点打上一些与GPU设备属性相关的标签,比如:该节点的GPU驱动是哪个版本,GPU显存是多大等,其效果是节点拥有很多以"nvidia.com"开头的标签。
$ kubectl get nodes cn-beijing.192.168.8.44 -o yaml | grep nvidia.com | grep -v "feature-labels"
nvidia.com/cuda.driver.major: "450"
nvidia.com/cuda.driver.minor: "102"
nvidia.com/cuda.driver.rev: "04"
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1616729805"
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "0"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: volta
nvidia.com/gpu.machine: Alibaba-Cloud-ECS
nvidia.com/gpu.memory: "16160"
nvidia.com/gpu.present: "true"
nvidia.com/gpu.product: Tesla-V100-SXM2-16GB
nvidia.com/mig.strategy: single
nvidia.com/gpu: "1"
nvidia.com/gpu: "1"
打上这些标签的主要作用是在之后的任务调度中,可以根据这些标签将任务调度到指定节点上。
集群部署GPU Feature Discovery组件
接下来演示一下如何在集群中部署GPU Feature Discovery组件。
前提条件
-
k8s集群的版本 > 1.8。
-
集群中的GPU节点已经安装了GPU驱动,如果没有安装驱动,请参考本系列文章《NVIDIA GPU Operator分析一:NVIDIA驱动安装》。
-
集群中的GPU节点已经安装NVIDIA Container Toolkit,如果没有安装,请参考本系列文章《NVIDIA GPU Operator分析二:NVIDIA Container Toolkit安装》。
-
集群中的GPU节点已经安装NVIDIA Device Plugin,如果没有安装,请参考本系列文章《NVIDIA GPU Operator分析三:NVIDIA Device Plugin安装》。
安装步骤
1.下载gpu-operator源码。
$ git clone -b 1.6.2 https://github.com/NVIDIA/gpu-operator.git
$ cd gpu-operator
$ export GPU_OPERATOR=$(pwd)
2.确认节点已经打了标签nvidia.com/gpu.present=true。
$ kubectl get nodes -L nvidia.com/gpu.present
NAME STATUS ROLES AGE VERSION GPU.PRESENT
cn-beijing.192.168.8.44 Ready <none> 13d v1.16.9-aliyun.1 true
cn-beijing.192.168.8.45 Ready <none> 13d v1.16.9-aliyun.1 true
cn-beijing.192.168.8.46 Ready <none> 13d v1.16.9-aliyun.1 true
cn-beijing.192.168.9.159 Ready master 13d v1.16.9-aliyun.1
cn-beijing.192.168.9.160 Ready master 13d v1.16.9-aliyun.1
cn-beijing.192.168.9.161 Ready master 13d v1.16.9-aliyun.1
3.删除不用的yaml。
$ rm -rf assets/gpu-feature-discovery/*openshift*
4.修改assets/gpu-feature-discovery/0500_daemonset.yaml中的镜像。
-
将container gpu-feature-discovery的镜像修改为nvcr.io/nvidia/gpu-feature-discovery:v0.4.1。
spec:
serviceAccount: nvidia-gpu-feature-discovery
containers:
- image: "nvcr.io/nvidia/gpu-feature-discovery:v0.4.1"
name: gpu-feature-discovery
5.部署gpu-feature-discovery组件。
$ kubectl apply -f assets/gpu-feature-discovery
6.查看组件的pod是否处于Running状态。
$ kubectl get po -n gpu-operator-resources -l app=gpu-feature-discovery
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-49css 1/1 Running 0 96s
gpu-feature-discovery-k4nlp 1/1 Running 0 96s
gpu-feature-discovery-lx46x 1/1 Running 0 96s
7.查看其中一个pod的日志。
$ kubectl logs gpu-feature-discovery-49css -n gpu-operator-resources
gpu-feature-discovery: 2021/03/31 12:58:46 Running gpu-feature-discovery in version v0.4.1
gpu-feature-discovery: 2021/03/31 12:58:46 Loaded configuration:
gpu-feature-discovery: 2021/03/31 12:58:46 Oneshot: false
gpu-feature-discovery: 2021/03/31 12:58:46 FailOnInitError: true
gpu-feature-discovery: 2021/03/31 12:58:46 SleepInterval: 1m0s
gpu-feature-discovery: 2021/03/31 12:58:46 MigStrategy: single
gpu-feature-discovery: 2021/03/31 12:58:46 NoTimestamp: false
gpu-feature-discovery: 2021/03/31 12:58:46 OutputFilePath: /etc/kubernetes/node-feature-discovery/features.d/gfd
gpu-feature-discovery: 2021/03/31 12:58:46 Start running
gpu-feature-discovery: 2021/03/31 12:58:46 Writing labels to output file
gpu-feature-discovery: 2021/03/31 12:58:46 Sleeping for 1m0s
gpu-feature-discovery: 2021/03/31 12:59:46 Writing labels to output file
gpu-feature-discovery: 2021/03/31 12:59:46 Sleeping for 1m0s
gpu-feature-discovery: 2021/03/31 13:00:46 Writing labels to output file
gpu-feature-discovery: 2021/03/31 13:00:46 Sleeping for 1m0s
8.查看GPU节点是否被打上以nvidia.com开头的标签。
(1)查看集群节点:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
cn-beijing.192.168.8.44 Ready <none> 18d v1.16.9-aliyun.1
cn-beijing.192.168.8.45 Ready <none> 18d v1.16.9-aliyun.1
cn-beijing.192.168.8.46 Ready <none> 18d v1.16.9-aliyun.1
cn-beijing.192.168.9.159 Ready master 19d v1.16.9-aliyun.1
cn-beijing.192.168.9.160 Ready master 19d v1.16.9-aliyun.1
cn-beijing.192.168.9.161 Ready master 19d v1.16.9-aliyun.1
(2)以cn-beijing.192.168.8.45为例,查看其标签:
$ kubectl get nodes cn-beijing.192.168.8.44 -o yaml | grep -v feature-labels | grep nvidia.com
nvidia.com/cuda.driver.major: "450"
nvidia.com/cuda.driver.minor: "102"
nvidia.com/cuda.driver.rev: "04"
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "0"
nvidia.com/gfd.timestamp: "1616729805"
nvidia.com/gpu.compute.major: "7"
nvidia.com/gpu.compute.minor: "0"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: volta
nvidia.com/gpu.machine: Alibaba-Cloud-ECS
nvidia.com/gpu.memory: "16160"
nvidia.com/gpu.present: "true"
nvidia.com/gpu.product: Tesla-V100-SXM2-16GB
nvidia.com/mig.strategy: single
nvidia.com/gpu: "1"
nvidia.com/gpu: "1"
可以看到,节点已经打上了与GPU设备属性相关的标签。
总结
本篇文章主要演示了如何在集群中手动安装GPU Feature Discovery组件,至此NVIDIA GPU Operator所涉及的所有组件我们都手动安装了一遍。在本系列文章的下一篇文章中,我们将从源码的角度分析NVIDIA GPU Operator是怎么实现的。