CUDA编程一天入门

简介: 本文介绍了CUDA编程的基础知识,包括环境准备、编程模型、内核设置、示例代码simpleTexture3D,以及相关参考链接。

0 环境准备

1 套路

CUDA 编程模型是一个异构模型,其中同时使用 CPU 和 GPU。在 CUDA 中,主机是指 CPU 及其内存,而设备是指 GPU 及其内存。在主机上运行的代码可以管理主机和设备上的内存,还可以启动内核,内核是在设备上执行的功能。这些内核由许多 GPU 线程并行执行。

鉴于 CUDA 编程模型的异构性质,CUDA C 程序的典型操作顺序是:

  1. 声明并分配主机和设备内存。
  2. 初始化主机数据。
  3. 将数据从主机传输到设备。
  4. 执行一个或多个内核。
  5. 将结果从设备传输到主机。

2 并行执行内核设置

三重 V 形之间的信息是执行配置,它指示并行执行内核的设备线程数。在 CUDA 中,软件中有一个线程层次结构,它模仿线程处理器在 GPU 上的分组方式。在 CUDA 编程模型中,我们谈到启动带有线程块**网格的内核。执行配置中的第一个参数指定网格中的线程块数,第二个参数指定线程块中的线程数。

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

3 示例代码simpleTexture3D

/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

/*
  3D texture sample

  This sample loads a 3D volume from disk and displays slices through it
  using 3D texture lookups.
*/

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <helper_gl.h>

#if defined(__APPLE__) || defined(MACOSX)
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
#include <GLUT/glut.h>
#ifndef glutCloseFunc
#define glutCloseFunc glutWMCloseFunc
#endif
#else
#include <GL/freeglut.h>
#endif

// includes, cuda
#include <vector_types.h>
#include <cuda_runtime.h>
#include <cuda_gl_interop.h>

// CUDA utilities and system includes
#include <helper_cuda.h>
#include <helper_functions.h>
#include <vector_types.h>

typedef unsigned int uint;
typedef unsigned char uchar;

#define MAX_EPSILON_ERROR 5.0f
#define THRESHOLD 0.15f

const char *sSDKsample = "simpleTexture3D";

const char *volumeFilename = "Bucky.raw";
const cudaExtent volumeSize = make_cudaExtent(32, 32, 32);

const uint width = 512, height = 512;
const dim3 blockSize(16, 16, 1);
const dim3 gridSize(width / blockSize.x, height / blockSize.y);

float w = 0.5;  // texture coordinate in z

GLuint pbo;  // OpenGL pixel buffer object
struct cudaGraphicsResource
    *cuda_pbo_resource;  // CUDA Graphics Resource (to transfer PBO)

bool linearFiltering = true;
bool animate = true;

StopWatchInterface *timer = NULL;

uint *d_output = NULL;

// Auto-Verification Code
const int frameCheckNumber = 4;
int fpsCount = 0;  // FPS count for averaging
int fpsLimit = 1;  // FPS limit for sampling
int g_Index = 0;
unsigned int frameCount = 0;
unsigned int g_TotalErrors = 0;
volatile int g_GraphicsMapFlag = 0;

int *pArgc = NULL;
char **pArgv = NULL;

#ifndef MAX
#define MAX(a, b) ((a > b) ? a : b)
#endif

extern "C" void cleanup();
extern "C" void setTextureFilterMode(bool bLinearFilter);
extern "C" void initCuda(const uchar *h_volume, cudaExtent volumeSize);
extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output,
                              uint imageW, uint imageH, float w);
extern void cleanupCuda();

void loadVolumeData(char *exec_path);

void computeFPS() {
   
  frameCount++;
  fpsCount++;

  if (fpsCount == fpsLimit) {
   
    char fps[256];
    float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);
    sprintf(fps, "%s: %3.1f fps", sSDKsample, ifps);

    glutSetWindowTitle(fps);
    fpsCount = 0;

    fpsLimit = ftoi(MAX(1.0f, ifps));
    sdkResetTimer(&timer);
  }
}

// render image using CUDA
void render() {
   
  // map PBO to get CUDA device pointer
  g_GraphicsMapFlag++;
  checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));
  size_t num_bytes;
  checkCudaErrors(cudaGraphicsResourceGetMappedPointer(
      (void **)&d_output, &num_bytes, cuda_pbo_resource));
  // printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes);

  // call CUDA kernel, writing results to PBO
  render_kernel(gridSize, blockSize, d_output, width, height, w);

  getLastCudaError("render_kernel failed");

  if (g_GraphicsMapFlag) {
   
    checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0));
    g_GraphicsMapFlag--;
  }
}

// display results using OpenGL (called by GLUT)
void display() {
   
  sdkStartTimer(&timer);

  render();

  // display results
  glClear(GL_COLOR_BUFFER_BIT);

  // draw image from PBO
  glDisable(GL_DEPTH_TEST);
  glRasterPos2i(0, 0);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
  glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

  glutSwapBuffers();
  glutReportErrors();

  sdkStopTimer(&timer);
  computeFPS();
}

void idle() {
   
  if (animate) {
   
    w += 0.01f;
    glutPostRedisplay();
  }
}

void keyboard(unsigned char key, int x, int y) {
   
  switch (key) {
   
    case 27:
#if defined(__APPLE__) || defined(MACOSX)
      exit(EXIT_SUCCESS);
      glutDestroyWindow(glutGetWindow());
      return;
#else
      glutDestroyWindow(glutGetWindow());
      return;
#endif

    case '=':
    case '+':
      w += 0.01f;
      break;

    case '-':
      w -= 0.01f;
      break;

    case 'f':
      linearFiltering = !linearFiltering;
      setTextureFilterMode(linearFiltering);
      break;

    case ' ':
      animate = !animate;
      break;

    default:
      break;
  }

  glutPostRedisplay();
}

void reshape(int x, int y) {
   
  glViewport(0, 0, x, y);

  glMatrixMode(GL_MODELVIEW);
  glLoadIdentity();

  glMatrixMode(GL_PROJECTION);
  glLoadIdentity();
  glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0);
}

void cleanup() {
   
  sdkDeleteTimer(&timer);

  // add extra check to unmap the resource before unregistering it
  if (g_GraphicsMapFlag) {
   
    checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0));
    g_GraphicsMapFlag--;
  }

  // unregister this buffer object from CUDA C
  checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));
  glDeleteBuffers(1, &pbo);
  cleanupCuda();
}

void initGLBuffers() {
   
  // create pixel buffer object
  glGenBuffers(1, &pbo);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo);
  glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width * height * sizeof(GLubyte) * 4,
               0, GL_STREAM_DRAW_ARB);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

  // register this buffer object with CUDA
  checkCudaErrors(cudaGraphicsGLRegisterBuffer(
      &cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));
}

// Load raw data from disk
uchar *loadRawFile(const char *filename, size_t size) {
   
  FILE *fp = fopen(filename, "rb");

  if (!fp) {
   
    fprintf(stderr, "Error opening file '%s'\n", filename);
    return 0;
  }

  uchar *data = (uchar *)malloc(size);
  size_t read = fread(data, 1, size, fp);
  fclose(fp);

  printf("Read '%s', %zu bytes\n", filename, read);

  return data;
}

void initGL(int *argc, char **argv) {
   
  // initialize GLUT callback functions
  glutInit(argc, argv);
  glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE);
  glutInitWindowSize(width, height);
  glutCreateWindow("CUDA 3D texture");
  glutDisplayFunc(display);
  glutKeyboardFunc(keyboard);
  glutReshapeFunc(reshape);
  glutIdleFunc(idle);

  if (!isGLVersionSupported(2, 0) ||
      !areGLExtensionsSupported("GL_ARB_pixel_buffer_object")) {
   
    fprintf(stderr, "Required OpenGL extensions are missing.");
    exit(EXIT_FAILURE);
  }
}

void runAutoTest(const char *ref_file, char *exec_path) {
   
  checkCudaErrors(
      cudaMalloc((void **)&d_output, width * height * sizeof(GLubyte) * 4));

  // render the volumeData
  render_kernel(gridSize, blockSize, d_output, width, height, w);

  checkCudaErrors(cudaDeviceSynchronize());
  getLastCudaError("render_kernel failed");

  void *h_output = malloc(width * height * sizeof(GLubyte) * 4);
  checkCudaErrors(cudaMemcpy(h_output, d_output,
                             width * height * sizeof(GLubyte) * 4,
                             cudaMemcpyDeviceToHost));
  sdkDumpBin(h_output, width * height * sizeof(GLubyte) * 4,
             "simpleTexture3D.bin");

  bool bTestResult = sdkCompareBin2BinFloat(
      "simpleTexture3D.bin", sdkFindFilePath(ref_file, exec_path),
      width * height, MAX_EPSILON_ERROR, THRESHOLD, exec_path);

  checkCudaErrors(cudaFree(d_output));
  free(h_output);

  sdkStopTimer(&timer);
  sdkDeleteTimer(&timer);

  exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE);
}

void loadVolumeData(char *exec_path) {
   
  // load volume data
  const char *path = sdkFindFilePath(volumeFilename, exec_path);

  if (path == NULL) {
   
    fprintf(stderr, "Error unable to find 3D Volume file: '%s'\n",
            volumeFilename);
    exit(EXIT_FAILURE);
  }

  size_t size = volumeSize.width * volumeSize.height * volumeSize.depth;
  uchar *h_volume = loadRawFile(path, size);

  initCuda(h_volume, volumeSize);
  sdkCreateTimer(&timer);

  free(h_volume);
}


// Program main

int main(int argc, char **argv) {
   
  pArgc = &argc;
  pArgv = argv;

  char *ref_file = NULL;

#if defined(__linux__)
  setenv("DISPLAY", ":0", 0);
#endif

  printf("%s Starting...\n\n", sSDKsample);

  if (checkCmdLineFlag(argc, (const char **)argv, "file")) {
   
    fpsLimit = frameCheckNumber;
    getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file);
  }

  // use command-line specified CUDA device, otherwise use device with highest
  // Gflops/s
  findCudaDevice(argc, (const char **)argv);

  if (ref_file) {
   
    loadVolumeData(argv[0]);
    runAutoTest(ref_file, argv[0]);
  } else {
   
    initGL(&argc, argv);

    // OpenGL buffers
    initGLBuffers();

    loadVolumeData(argv[0]);
  }

  printf(
      "Press space to toggle animation\n"
      "Press '+' and '-' to change displayed slice\n");

#if defined(__APPLE__) || defined(MACOSX)
  atexit(cleanup);
#else
  glutCloseFunc(cleanup);
#endif

  glutMainLoop();

  exit(EXIT_SUCCESS);
}
/* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *  * Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *  * Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *  * Neither the name of NVIDIA CORPORATION nor the names of its
 *    contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
 * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
 * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
 * OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */

#ifndef _SIMPLETEXTURE3D_KERNEL_CU_
#define _SIMPLETEXTURE3D_KERNEL_CU_

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

#include <helper_cuda.h>
#include <helper_math.h>

typedef unsigned int uint;
typedef unsigned char uchar;

cudaArray *d_volumeArray = 0;
cudaTextureObject_t tex;  // 3D texture

__global__ void d_render(uint *d_output, uint imageW, uint imageH, float w,
                         cudaTextureObject_t texObj) {
   
  uint x = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
  uint y = __umul24(blockIdx.y, blockDim.y) + threadIdx.y;

  float u = x / (float)imageW;
  float v = y / (float)imageH;
  // read from 3D texture
  float voxel = tex3D<float>(texObj, u, v, w);

  if ((x < imageW) && (y < imageH)) {
   
    // write output color
    uint i = __umul24(y, imageW) + x;
    d_output[i] = voxel * 255;
  }
}

extern "C" void setTextureFilterMode(bool bLinearFilter) {
   
  if (tex) {
   
    checkCudaErrors(cudaDestroyTextureObject(tex));
  }
  cudaResourceDesc texRes;
  memset(&texRes, 0, sizeof(cudaResourceDesc));

  texRes.resType = cudaResourceTypeArray;
  texRes.res.array.array = d_volumeArray;

  cudaTextureDesc texDescr;
  memset(&texDescr, 0, sizeof(cudaTextureDesc));

  texDescr.normalizedCoords = true;
  texDescr.filterMode =
      bLinearFilter ? cudaFilterModeLinear : cudaFilterModePoint;
  ;
  texDescr.addressMode[0] = cudaAddressModeWrap;
  texDescr.addressMode[1] = cudaAddressModeWrap;
  texDescr.addressMode[2] = cudaAddressModeWrap;
  texDescr.readMode = cudaReadModeNormalizedFloat;

  checkCudaErrors(cudaCreateTextureObject(&tex, &texRes, &texDescr, NULL));
}

extern "C" void initCuda(const uchar *h_volume, cudaExtent volumeSize) {
   
  // create 3D array
  cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<uchar>();
  checkCudaErrors(cudaMalloc3DArray(&d_volumeArray, &channelDesc, volumeSize));

  // copy data to 3D array
  cudaMemcpy3DParms copyParams = {
   0};
  copyParams.srcPtr =
      make_cudaPitchedPtr((void *)h_volume, volumeSize.width * sizeof(uchar),
                          volumeSize.width, volumeSize.height);
  copyParams.dstArray = d_volumeArray;
  copyParams.extent = volumeSize;
  copyParams.kind = cudaMemcpyHostToDevice;
  checkCudaErrors(cudaMemcpy3D(&copyParams));

  cudaResourceDesc texRes;
  memset(&texRes, 0, sizeof(cudaResourceDesc));

  texRes.resType = cudaResourceTypeArray;
  texRes.res.array.array = d_volumeArray;

  cudaTextureDesc texDescr;
  memset(&texDescr, 0, sizeof(cudaTextureDesc));

  // access with normalized texture coordinates
  texDescr.normalizedCoords = true;
  // linear interpolation
  texDescr.filterMode = cudaFilterModeLinear;
  // wrap texture coordinates
  texDescr.addressMode[0] = cudaAddressModeWrap;
  texDescr.addressMode[1] = cudaAddressModeWrap;
  texDescr.addressMode[2] = cudaAddressModeWrap;
  texDescr.readMode = cudaReadModeNormalizedFloat;

  checkCudaErrors(cudaCreateTextureObject(&tex, &texRes, &texDescr, NULL));
}

extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output,
                              uint imageW, uint imageH, float w) {
   
  d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, w, tex);
}

void cleanupCuda() {
   
  if (tex) {
   
    checkCudaErrors(cudaDestroyTextureObject(tex));
  }
  if (d_volumeArray) {
   
    checkCudaErrors(cudaFreeArray(d_volumeArray));
  }
}

#endif  // #ifndef _SIMPLETEXTURE3D_KERNEL_CU_

4 参考链接

GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit

CUDA routines are functions that can be executed on the GPU using many threads in parallel1.
There are many CUDA code samples included as part of the CUDA Toolkit to help you get started on writing CUDA C/C++ applications2.
You can also find some easy introductions to CUDA C and C++ on the NVIDIA Technical Blog31.

GitHub - LitLeo/OpenCUDA

CUDA开发环境搭建 - 知乎 (zhihu.com)

(191条消息) VS+CUDA 新建项目里没有CUDA选项(附详细图文步骤)_cuda vs_Xav Zewen的博客-CSDN博客

相关实践学习
在云上部署ChatGLM2-6B大模型(GPU版)
ChatGLM2-6B是由智谱AI及清华KEG实验室于2023年6月发布的中英双语对话开源大模型。通过本实验,可以学习如何配置AIGC开发环境,如何部署ChatGLM2-6B大模型。
相关文章
|
机器学习/深度学习 并行计算 API
【GPU】CUDA是什么?以及学习路线图!
【GPU】CUDA是什么?以及学习路线图!
4901 0
|
6月前
|
存储 缓存 人工智能
Mooncake 最新进展:SGLang 和 LMCache 基于 Mooncake 实现高效 PD 分离框架
Mooncake 的架构设计兼具高性能和灵活性,为未来的扩展性和生态建设奠定了坚实基础。
|
9月前
|
存储 人工智能 缓存
DeepSeek 开源周第三弹!DeepGEMM:FP8矩阵计算神器!JIT编译+Hopper架构优化,MoE性能飙升
DeepGEMM 是 DeepSeek 开源的专为 FP8 矩阵乘法设计的高效库,支持普通和混合专家(MoE)分组的 GEMM 操作,基于即时编译技术,动态优化矩阵运算,显著提升计算性能。
921 3
DeepSeek 开源周第三弹!DeepGEMM:FP8矩阵计算神器!JIT编译+Hopper架构优化,MoE性能飙升
|
11月前
|
机器学习/深度学习 人工智能 PyTorch
【AI系统】计算图原理
本文介绍了AI框架中使用计算图来抽象神经网络计算的必要性和优势,探讨了计算图的基本构成,包括标量、向量、矩阵、张量等数据结构及其操作,并详细解释了计算图如何帮助解决AI工程化中的挑战。此外,文章还通过PyTorch实例展示了动态计算图的特点和实现方法,包括节点(张量或函数)和边(依赖关系)的定义,以及如何通过自定义Function实现正向和反向传播逻辑。
459 7
【AI系统】计算图原理
|
11月前
|
人工智能 并行计算 程序员
【AI系统】SIMD & SIMT 与芯片架构
本文深入解析了SIMD(单指令多数据)与SIMT(单指令多线程)的计算本质及其在AI芯片中的应用,特别是NVIDIA CUDA如何实现这两种计算模式。SIMD通过单指令对多个数据进行操作,提高数据并行处理能力;而SIMT则在GPU上实现了多线程并行,每个线程独立执行相同指令,增强了灵活性和性能。文章详细探讨了两者的硬件结构、编程模型及硬件执行模型的区别与联系,为理解现代AI计算架构提供了理论基础。
1814 12
|
12月前
|
人工智能 并行计算 流计算
【AI系统】GPU 架构与 CUDA 关系
本文介绍了英伟达GPU硬件基础概念,重点解析了A100 GPU架构中的GPC、TPC、SM等组件及其功能。接着深入讲解了CUDA并行计算平台和编程模型,特别是CUDA线程层次结构。最后,文章探讨了如何根据CUDA核心数量、核心频率等因素计算GPU的算力峰值,这对于评估大模型训练的算力需求至关重要。
953 3
|
12月前
|
机器学习/深度学习 人工智能 并行计算
【AI系统】Tensor Core 基本原理
本文深入介绍了英伟达GPU中的Tensor Core,一种专为加速深度学习设计的硬件单元。文章从发展历程、卷积计算、混合精度训练及基本原理等方面,详细解析了Tensor Core的工作机制及其在深度学习中的应用,旨在帮助读者全面理解Tensor Core技术。通过具体代码示例,展示了如何在CUDA编程中利用Tensor Core实现高效的矩阵运算,从而加速模型训练和推理过程。
1882 0
|
机器学习/深度学习 人工智能 监控
一文读懂deepSpeed:深度学习训练的并行化
DeepSpeed 是由微软开发的开源深度学习优化库,旨在提高大规模模型训练的效率和可扩展性。通过创新的并行化策略、内存优化技术(如 ZeRO)及混合精度训练,DeepSpeed 显著提升了训练速度并降低了资源需求。它支持多种并行方法,包括数据并行、模型并行和流水线并行,同时与 PyTorch 等主流框架无缝集成,提供了易用的 API 和丰富的文档支持。DeepSpeed 不仅大幅减少了内存占用,还通过自动混合精度训练提高了计算效率,降低了能耗。其开源特性促进了 AI 行业的整体进步,使得更多研究者和开发者能够利用先进优化技术,推动了 AI 在各个领域的广泛应用。
|
存储 机器学习/深度学习 并行计算
GPU通信互联技术:GPUDirect、NVLink与RDMA
在高性能计算和深度学习领域,GPU已成为关键工具。然而,随着模型复杂度和数据量的增加,单个GPU难以满足需求,多GPU甚至多服务器协同工作成为常态。本文探讨了三种主要的GPU通信互联技术:GPUDirect、NVLink和RDMA。GPUDirect通过绕过CPU实现GPU与设备直接通信;NVLink提供高速点对点连接和支持内存共享;RDMA则在网络层面实现直接内存访问,降低延迟。这些技术各有优势,适用于不同场景,为AI和高性能计算提供了强大支持。
|
机器学习/深度学习 并行计算 PyTorch
PyTorch与CUDA:加速深度学习模型训练的最佳实践
【8月更文第27天】随着深度学习应用的广泛普及,高效利用GPU硬件成为提升模型训练速度的关键。PyTorch 是一个强大的深度学习框架,它支持动态计算图,易于使用且高度灵活。CUDA (Compute Unified Device Architecture) 则是 NVIDIA 开发的一种并行计算平台和编程模型,允许开发者直接访问 GPU 的并行计算能力。本文将详细介绍如何利用 PyTorch 与 CUDA 的集成来加速深度学习模型的训练过程,并提供具体的代码示例。
1360 1