CUDA异步并发 (二)

简介: 感觉自己对CUDA的异步并发还是无法深入理解,在大神建议下看了CUDA C Programming Guide ,感觉对自己的帮助很大。 下面分享一下看了之后的体会,附上原文: Concurrent host execution is facilitated through asynchronous library functions thatreturn control

感觉自己对CUDA的异步并发还是无法深入理解,在大神建议下看了CUDA C Programming Guide ,感觉对自己的帮助很大。
下面分享一下看了之后的体会,附上原文:
Concurrent host execution is facilitated through asynchronous library functions thatreturn control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other
tasks. The following device operations are asynchronous with respect to the host:
异步调用是在GPU运行时,CPU并不是堵塞或等待的,也可以进行一定的程序的运行;
The following device operations are asynchronous with respect to the host:
‣ Kernel launches;
‣ Memory copies within a single device’s memory;
‣ Memory copies from host to device of a memory block of 64 KB or less;
‣ Memory copies performed by functions that are suffixed with Async;
‣ Memory set function calls.
下面几个在设备上运行的程序与主机是异步执行的:
1.内核的启动;
2.单一设备上的内存拷贝;
3.从主机拷贝内存块到设备大小小于64k;
4.内存拷贝后缀为异步执行的函数;
5.Memory set function calls

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably;
程序员可以通过CUDA_LAUNCH_BLOCKING 设为1来禁用异步并发 ,这只试用于调试;

Kernel launches are synchronous if hardware counters are collected via a profiler(Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memorycopies will also be synchronous if they involve host memory that is not page-locked;
异步并发的效果可以用Nsight 和Visual Profiler两个检测工具看到,异步内存也将同步副本如果他们涉及到主机内存不是page-locked

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently. Applications may query this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.
当GPU的计算能力大于2可以多核并发。程序会通过查看concurrentKernels来确定;

The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 13。
最大核并发数与GPU的计算能力有关,见下表:
这里写图片描述
这里写图片描述

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context.
一个CUDA内核不能执行另一个CUDA内核;

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels。
内核使用许多纹理或大量的本地内存不大可能与其他内核并发执行。

Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.
异步从主机与设备之间的内存拷贝时,程序通过asyncEngineCount 检测设备,从主机向设备拷贝时,必须是页锁定内存才有异步效果;

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-evice copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.
也可以同时执行intra-device(设备内复制,显存到显存)复制内核执行(设备支持concurrentKernels)。

Concurrent Data Transfers
Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be pagelocked.
并行数据的传输 一些计算能力大于2的设备可以 overlap copy,程序通过 asyncEngineCount来检测设备是否支持功能,这个转移内存必须是页锁定的;

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined).
应用程序通过流来管理应用程序。流是相互分离的一些命令。不同的流可能会不安顺序执行会彼此并发;

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host < > device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in pagelocked memory.
一个流通过创建一个流主体来定义,指定流参数一个序列的内核发射,主机和设备之间的内存交换。下面的流例子:

cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
    cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:
每一个流被定义一连串的从主机到设备内存的拷贝,一次内核发射,一次内存拷贝;

for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel <<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
    size, cudaMemcpyDeviceToHost, stream[i]);
}

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.
每个流复制其输入数组hostPtr到数组inputDevPtr设备内存,进程inputDevPtr在设备上通过调用MyKernel()和将结果outputDevPtr 复制回到hostPtr。重叠的行为描述了流如何重叠在这个例子中依赖的能力设备。注意,hostPtr必须是页锁存内存。

for (int i = 0; i < 2; ++i)
    cudaStreamDestroy(stream[i]);

cudaStreamDestroy() waits for all preceding commands in the given stream to complete before destroying the stream and returning control to the host thread.
cudaStreamDestroy() 等所有流上的进程指令完成后摧毁流并返回主机线程。

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the defaultstream. They are therefore executed in order.
核函数的发射和主机与设备内存的复制不指定任何流的参数,或者相当于把流的参数设制为0,是发布默认的流。他们按顺序执行;

For code that is compiled using the –default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular streamand each host thread has its own default stream.
代码编译使用——default-stream线程编译标志(在包含库之前用CUDA_API_PER_THREAD_DEFAULT_STREAM定义),默认的流是一个常规的流,主机每一个线程都有他自己的默认流;

For code that is compiled using the –default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.
代码编译使用——default-stream遗留编译标志, 默认流是一种特殊的流称为零流和每个设备都有一个单独的空流用于所有主机线程。零流是特别的,因为它导致隐式隐式同步所述同步。

For code that is compiled without specifying a –default-stream compilation flag, –default-stream legacy is assumed as the default.
default -stream是默认的流;

There are various ways to explicitly synchronize streams with each other.
cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.
cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.
cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call to cudaStreamWaitEvent()delay their execution until the given event has completed. The stream can be 0, in which case all the commands added to any stream after the call to cudaStreamWaitEvent()wait on the event.
cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.

To avoid unnecessary slowdowns, all these synchronization functions are usually best used for timing purposes or to isolate a launch or memory copy that is failing
几种流同步的方法:
1.cudaDeviceSynchronize() 等待线程所有流都完成;
2.cudaStreamSynchronize()把一个流作为参考等待直到所有给流进程的指令都完成。他可以用作用一个特殊的流同步主机让其他流继续在设备上执行,
3.cudaStreamWaitEvent()把一个流或者事件作为参考等待其他所有指令都执行到这一步,流可以为零;
4.cudaStreamQuery()让程序知道在流上所有指令是否以经完成;

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
‣ a page-locked host memory allocation,
‣ a device memory allocation,
‣ a device memory set,
‣ a memory copy between two addresses to the same device memory,
‣ any CUDA command to the NULL stream,
‣ a switch between the L1/shared memory configurations described in Compute Capability 2.x and Compute Capability 3.x.
发生以下几个指令时,流不能同时执行:
1.页锁存内存的分配;
2.设备内存的分配;
3.设备内存的设置;
4.设备上内存的拷贝;
5任何CUDA非流指令;
6一个在l1缓存或共享内存上的切换在计算能力在2.x或3.x的设备;

For devices that support concurrent kernel execution and are of compute capability 3.0 or lower, any operation that requires a dependency check to see if a streamed kernel launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel launch being checked is complete.
计算能力3.0以下的设备任何操作都需要依赖检查一个流的内核启动完成:

Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potentia for concurrent kernel execution:
‣ All independent operations should be issued before dependent operations,
‣ Synchronization of any kind should be delayed as long as possible
操作需要一个依赖项检查中包含任何其他命令流一样发射检查和任何调用cudaStreamQuery() 流。因此,应用程序应该遵循这些指导方针:
1.非独立操作应在独立操作之后;
2.同步的时间要足够长;

笔者文笔不足,可能中间会有错误欢迎指正;

相关实践学习
在云上部署ChatGLM2-6B大模型(GPU版)
ChatGLM2-6B是由智谱AI及清华KEG实验室于2023年6月发布的中英双语对话开源大模型。通过本实验,可以学习如何配置AIGC开发环境,如何部署ChatGLM2-6B大模型。
目录
相关文章
|
6月前
|
人工智能 自然语言处理
AI、大模型如何重塑海外舆情监测行业
与此同时,以人工智能(AI)与大语言模型(LLM)为代表的新一代技术,正在深刻重塑舆情监测行业的核心逻辑。从过去依赖关键词和人工分析的监测体系,到如今多语言理解、深层情绪识别与传播趋势预测等高阶能力的构建,AI不仅极大提升了舆情监测的效率与精准度,也推动其从单一的危机应对工具,迈向企业战略制定的智能引擎。
378 0
|
负载均衡 算法 数据安全/隐私保护
|
4月前
|
开发工具 Android开发 开发者
用Flet打造跨平台文本编辑器:从零到一的Python实战指南
本文介绍如何使用Flet框架开发一个跨平台、自动保存的文本编辑器,代码不足200行,兼具现代化UI与高效开发体验。
661 0
|
缓存 Java Nacos
nacos常见问题之无法注册如何解决
Nacos是阿里云开源的服务发现和配置管理平台,用于构建动态微服务应用架构;本汇总针对Nacos在实际应用中用户常遇到的问题进行了归纳和解答,旨在帮助开发者和运维人员高效解决使用Nacos时的各类疑难杂症。
3026 7
|
机器学习/深度学习 人工智能 自然语言处理
NVIDIA Triton系列03-开发资源说明
NVIDIA Triton 推理服务器是用于高效部署机器学习模型的开源工具。本文介绍了初学者如何通过官方文档和 GitHub 开源仓库获取开发资源,包括快速启动指南、生产文档、示例和反馈渠道。特别强调了核心仓库中的六个重要部分,涵盖服务器部署、核心功能、后端支持、客户端接口、模型分析和模型导航工具。这些资源有助于初学者全面了解和掌握 Triton 项目。
513 0
NVIDIA Triton系列03-开发资源说明
|
弹性计算 Java 关系型数据库
分库分表比较推荐的方案
ShardingSphere 绝对可以说是当前分库分表的首选!ShardingSphere 的功能完善,除了支持读写分离和分库分表,还提供分布式事务、数据库治理等功能。另外,ShardingSphere 的生态体系完善,社区活跃,文档完善,更新和发布比较频繁
510 0
|
数据采集 存储 安全
利用爬虫技术自动化采集汽车之家的车型参数数据
汽车之家是一个专业的汽车网站,提供了丰富的汽车信息,包括车型参数、图片、视频、评测、报价等。如果我们想要获取这些信息,我们可以通过浏览器手动访问网站,或者利用爬虫技术自动化采集数据。本文将介绍如何使用Python编写一个简单的爬虫程序,实现对汽车之家的车型参数数据的自动化采集,并使用亿牛云爬虫代理服务来提高爬虫的稳定性和效率。
1033 0
利用爬虫技术自动化采集汽车之家的车型参数数据
|
数据采集 自然语言处理 语音技术
LangChain进阶:创建多模态应用
【8月更文第4天】随着自然语言处理 (NLP) 和计算机视觉 (CV) 技术的不断发展,多模态应用变得越来越普遍。这些应用结合了文本、图像、音频等多种数据类型,以增强用户体验并解决复杂的问题。LangChain 作为一款强大的工具链,可以很好地支持多模态数据的处理,从而开发出具有高度互动性和实用性的应用。
1251 1
|
存储 监控 算法
[etcd]简介与安装
[etcd]简介与安装
291 0
|
机器学习/深度学习 存储 人工智能
这7个矢量数据库你应该知道!
这7个矢量数据库你应该知道!
5524 10

热门文章

最新文章