《阿里云产品四月刊》—GPU Device-Plugin 相关操作(1)https://developer.aliyun.com/article/1554189
查看并更新 Device-Plugin 版本
您在目标节点上/etc/kubernetes/manifests/nvidia-device-plugin.yml 文件中查看的 device-plugin 的镜像 Tag,其所代表的版本号即为 Device-Plugin 的版本。
目前 ACK 所支持的最新的版本为 v0.9.1-3f942982-aliyun , 如需将节点中的nvidia-device-plugin 升级至最新版本,请将 nvidia-device-plugin 的static 的YAML 文件即/etc/kubernetes/manifests/nvidia-device-plugin.yml 更新为以下内容:
apiVersion: v1kind: Podmetadata: annotations: scheduler.alpha.kubernetes.io/critical pod: "" labels: component: nvidia device plugin name: nvidia device plugin namespace: kube systemspec: priorityClassName: system node critical hostNetwork: true containers: image: registry <REGION ID> vpc.ack.aliyuncs.com/acs/k8s device plugin:v0.9. 1 3f942982 aliyun # Image 中的<REGION ID>需要替换为您节点所在的阿里云的 Region Id,例如cn beijing、cn hangzhou 等。 name: nvidia device plugin ctr args: [" fail on init error=false"," pass device specs=true"," device i d strategy=index"] livenessProbe: httpGet: path: /health port: 30080 initialDelaySeconds: 10 timeoutSeconds: 2 periodSeconds: 5 failureThreshold: 3 resources: limits: memory: "200Mi" cpu: "500m" env: name: DP DISABLE HEALTHCHECKS value: all securityContext: allowPrivilegeEscalation: false capabilities: drop: ["ALL"] volumeMounts: name: device plugin mountPath: /var/lib/kubelet/device plugins name: device plugin config mountPath: /etc/nvidia device plugin volumes: name: device plugin hostPath: path: /var/lib/kubelet/device plugins name: device plugin config hostPath: path: /etc/nvidia device plugin type: DirectoryOrCreate
相关文档
如遇到 GPU 节点相关问题,请参见自助诊断 GPU 节点问题、GPU FAQ。如需了解共享 GPU 调度的相关信息,请参见共享 GPU 调度概述。