GPU Parallel Computing-阿里云开发者社区

GPU Parallel Computing

2017-11-18 1430

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

GPU

　　GPU英文全称Graphic Processing Unit，中文翻译为“图形处理器”。GPU是相对于CPU的一个概念，由于在现代的计算机中（特别是家用系统，游戏的发烧友）图形的处理变得越来越重要，需要一个专门的图形的核心处理器。

　　GPU有非常多的厂商都生产，和CPU一样，生产的厂商比较多，但大家熟悉的却只有3个，以至于大家以为GPU只有AMD、NVIDIA、Intel3个生产厂商。

nVidia GPU

AMD GPU

Intel MIC协处理器

nVidia Tegra 4

AMD ARM服务器

CUDA C/C++

CUDA fortran

OpenCL

MIC OpenMP

CUDA

GPU 并行计算

可以同CPU或主机进行协同处理
拥有自己的内存
可以同时开启1000个线程
单精度：4.58TFlops 双精度 1.31TFlops

　　GPU编程方面主要有一下方法：

　采用GPU进行计算时与CPU主要进行以下交互：

CPU与GPU之间的数据交换
在GPU上进行数据交换

GPU编程--CUDA　

CUDA C/C++: download CUDA drivers & compilers & samples (All In One Package ) free from:

http ://developer.nvidia.com/cuda/cuda-downloads

选择适合的版本~~~~我的下载的是5.0 notebook版本

具体安装方法：可参考这里http://blog.csdn.net/diyoosjtu/article/details/8454253

安装后，打开VS->新建，就会发现一个nVidia，里面有一个CUDA

　　主要过程：

Hello World
- 　　Basic syntax, compile & run

GPU memory management
- 　　Malloc/free
- 　　memcpy

Writing parallel kernels
- 　 Threads & block
- Memory hierachy

//hello_world.c:
#include <stdio.h>

void hello_world_kernel(){
    printf(“Hello World\n”);
}
int main(){    hello_world_kernel();}

Compile & Run:
gcc hello_world.c
./a.out

CUDA:

//hello_world.cu:
#include <stdio.h>
__global__ void hello_world_kernel(){
    printf(“Hello World\n”);
}

int main(){    hello_world_kernel<<<1,1>>>();}

Compile & Run:
nvcc hello_world.cu
./a.out

GPU计算的主要过程：

Allocate CPU memory for n integers
Allocate GPU memory for n integers
Initialize GPU memory to 0s
Copy from CPU to GPU
call the __global__function, compute
Keyword for CUDA kernel
Copy from GPU to CPU
Print the values
free

主要函数：

//Host (CPU) manages device (GPU) memory:
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)

int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a,  nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);

cudaMemcpy( void *dst,   void *src,   size_t nbytes, enum cudaMemcpyKind direction);
//returns after the copy is complete
/*blocks CPU thread until all bytes have been copied
doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
　　cudaMemcpyHostToDevice
　　cudaMemcpyDeviceToHost
　　cudaMemcpyDeviceToDevice*/

其中,<<<grid,block>>>

2-level hierarchy: blocks and grid
- 　　Block = a group of up to 1024 threads
- 　　Grid = all blocks for a given kernel launch
- 　　E.g. total 72 threads
  - 　　　　 blockDim=12, gridDim=6
A block can:
- 　　Synchronize their execution
- 　　Communicate via shared memory
Size of grid and blocks are specified during kernel launch

例子：

 
 
 View Code 

Thread index computation ：　

　　idx = blockIdx.x*blockDim.x + threadIdx.x:

应用

High performance math routines for your applications:

cuFFT – Fast Fourier Transforms Library
cuBLAS – Complete BLAS Library
cuSPARSE – Sparse Matrix Library
cuRAND – Random Number Generation (RNG) Library
NPP – Performance Primitives for Image & Video Processing
Thrust – Templated C++ Parallel Algorithms & Data Structures
math.h - C99 floating-point Library

本文转自cococo点点博客园博客，原文链接：http://www.cnblogs.com/coder2012/archive/2013/05/06/3056464.html，如需转载请自行联系原作者

GPU Parallel Computing

热门文章

最新文章

相关课程

相关电子书

相关实验场景

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

GPU Parallel Computing

热门文章

最新文章

相关课程

相关电子书

相关实验场景