【Python】GPU内存监控脚本-阿里云开发者社区

【Python】GPU内存监控脚本

2024-04-02 725

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 【Python】GPU内存监控脚本

相信很多小伙伴在项目中，需要监控GPU的使用状态，打开任务管理器，你会发现可以显示GPU的运行状态，但是无法将这些数据保留下来，这里我制作了python脚本用于监控专用GPU的使用情况！我使用的显卡是NVIDIA GeForce GTX 1660 Ti。、

一、知识预备

*如果对您有帮助，收藏+关注再走吧！！！*

（关注“测试开发自动化” 弓中皓，获取更多学习内容）

任务管理器中的专用GPU内存和共享GPU内存的含义是什么呢？

（1）专用GPU内存

分为两种情况：独显（独立显卡）和集显（集成显卡）

独显：是指单独的GPU PCIe卡，专有GPU内存就是指该GPU显卡上自带的内存，它只能够被GPU使用，而且带宽很高，延迟很小。

集显：BIOS把一部分内存在内存初始化后保留下来给GPU专用

（2）共享GPU内存

是操作系统Windows从系统内存中划出来，优先给GPU使用的内存

（3）GPU内存

GPU内存=专用GPU内存+共享GPU内存

二、python代码实现

#!/usr/bin/python
# -*- coding: utf-8 -*-

import time
import pynvml
from matplotlib import pyplot as plt
import matplotlib.backends.backend_tkagg


class GPUMonitor(object):
    def __init__(self, sleep_time):
        pynvml.nvmlInit()
        pynvml.nvmlSystemGetDriverVersion()
        self.GPUCounts = pynvml.nvmlDeviceGetCount()
        self.GPU_counts_list = [[]] * self.GPUCounts
        self.time = [[]] * self.GPUCounts
        self.sleep_time = sleep_time  # 秒

    def monitor(self):
        try:
            n = 0
            while True:
                GPUCount = 0
                # 读取GPU句柄
                handle = pynvml.nvmlDeviceGetHandleByIndex(GPUCount)
                # 读取GPU内存信息
                info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                total = f'{(info.total / 1024 / 1024 / 1024):.2f}'
                used = f'{(info.used / 1024 / 1024 / 1024):.2f}'
                free = f'{(info.free / 1024 / 1024 / 1024):.2f}'
                print(self.logfile(">>>>>>正在监控第{}块GPU内存<<<<<<<\n"
                                   "脚本已运行{}秒\n专用GPU内存：{}G\n"
                                   "已使用专用CPU内存：{}G\n剩余专用GPU内存：{}G\n"
                                   .format(GPUCount, n, float(total), float(used), float(free))))
                self.GPU_counts_list[GPUCount].append(float(used))
                self.time[GPUCount].append(n)
                self.paint(self.time[0], self.GPU_counts_list[0])
                time.sleep(self.sleep_time)
                n += self.sleep_time

        except:
            plt.savefig("CPU内存使用量.png")
            pynvml.nvmlShutdown()

    def paint(self, x_list, y_list):
        plt.clf()
        plt.plot(x_list, y_list)
        plt.title("GPU Usage Monitoring")
        plt.ylabel("GPU dedicated memory /G")
        plt.xlabel("time/s")
        plt.pause(0.1)  # 暂停一秒
        plt.ioff()

    def logfile(self, text):
        with open('image.log', 'a+', encoding='utf-8') as f:
            t = time.strftime('%y-%m-%d %H:%M:%S')
            text = t + " " + text + '\n'
            f.write(text)
        f.close()
        return text

    def abnormal(self):
        length = len(self.GPU_counts_list[0])
        average = sum(self.GPU_counts_list)/length
        self.logfile("平均专用GPU占用为：{}G".format(average))
        plt.savefig("CPU内存使用量.png")


if __name__ == "__main__":
    while True:
        times = input("请输入监控间隔时间(整秒>0)，按回车键开启监控：")
        if times.isdigit():
            if int(times) > 0:
                break
    a = GPUMonitor(int(times))
    try:
        a.monitor()
    except:
        plt.savefig("CPU内存使用量.png")

三、使用方法

（1）运行python代码后会提示输入监控间隔时间，即每隔几秒监控一次（这里我设置的是整秒，也根据需求改成非整秒），这里我选择每隔一秒监控一次。

（2）脚本启动后，会看到内存监控脚本已经开始运行，并在终端打印了监控信息；

（3）同时，能够显示实时的监控折线图信息；

（4）如果需要保存，点击（3）步中下方的的保存按钮即可。

（5）同时，也会生成存储监控信息的日志文件，供使用者查看。

如果对您有帮助，收藏+关注再走吧！！！

（关注“测试开发自动化” 弓中皓，获取更多学习内容）