GPU Parallel Computing

GPU Parallel Computing
GPU

　　GPU英文全称Graphic Processing Unit，中文翻译为“图形处理器”。GPU是相对于CPU的一个概念，由于在现代的计算机中（特别是家用系统，游戏的发烧友）图形的处理变得越来越重要，需要一个专门的图形的核心处理器。

　　GPU有非常多的厂商都生产，和CPU一样，生产的厂商比较多，但大家熟悉的却只有3个，以至于大家以为GPU只有AMD、NVIDIA、Intel3个生产厂商。

nVidia GPU AMD GPU Intel MIC协处理器 nVidia Tegra 4 AMD ARM服务器

CUDA C/C++

CUDA fortran
OpenCL MIC OpenMP CUDA

GPU 并行计算
- 可以同CPU或主机进行协同处理
- 拥有自己的内存
- 可以同时开启1000个线程
- 单精度：4.58TFlops 双精度 1.31TFlops
　　GPU编程方面主要有一下方法：

　采用GPU进行计算时与CPU主要进行以下交互：
- CPU与GPU之间的数据交换
- 在GPU上进行数据交换
GPU编程--CUDA　

CUDA C/C++: download CUDA drivers & compilers & samples (All In One Package ) free from:

http ://developer.nvidia.com/cuda/cuda-downloads

选择适合的版本~~~~我的下载的是5.0 notebook版本

具体安装方法：可参考这里http://blog.csdn.net/diyoosjtu/article/details/8454253

安装后，打开VS->新建，就会发现一个nVidia，里面有一个CUDA

　　主要过程：
- Hello World
  - 　　Basic syntax, compile & run
- GPU memory management
  - 　　Malloc/free
  - 　　memcpy
- Writing parallel kernels
  - 　 Threads & block
  - Memory hierachy
```
//hello_world.c:
#include <stdio.h>

void hello_world_kernel(){
    printf(“Hello World\n”);
}
int main(){    hello_world_kernel();}

Compile & Run:
gcc hello_world.c
./a.out
```
CUDA:
```
//hello_world.cu:
#include <stdio.h>
__global__ void hello_world_kernel(){
    printf(“Hello World\n”);
}

int main(){    hello_world_kernel<<<1,1>>>();}

Compile & Run:
nvcc hello_world.cu
./a.out
```
GPU计算的主要过程：
1. Allocate CPU memory for n integers
2. Allocate GPU memory for n integers
3. Initialize GPU memory to 0s
4. Copy from CPU to GPU
5. call the __global__function, compute
  Keyword for CUDA kernel
6. Copy from GPU to CPU
7. Print the values
8. free
主要函数：
```
//Host (CPU) manages device (GPU) memory:
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)

int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a,  nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);

cudaMemcpy( void *dst,   void *src,   size_t nbytes, enum cudaMemcpyKind direction);
//returns after the copy is complete
/*blocks CPU thread until all bytes have been copied
doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
　　cudaMemcpyHostToDevice
　　cudaMemcpyDeviceToHost
　　cudaMemcpyDeviceToDevice*/
```
其中,<<<grid,block>>>
- 2-level hierarchy: blocks and grid
  - 　　Block = a group of up to 1024 threads
  - 　　Grid = all blocks for a given kernel launch
  - 　　E.g. total 72 threads
    
    　　　　 blockDim=12, gridDim=6
- A block can:
  - 　　Synchronize their execution
  - 　　Communicate via shared memory
- Size of grid and blocks are specified during kernel launch
例子：
View Code
#include<stdio.h> __global__ void add(int *a, int *b) { *a = *a + *b; } int main() { int c=0; int a=1, b=2; int *h_a, *h_b; cudaMalloc(&h_a, sizeof(a)); cudaMalloc(&h_b, sizeof(b)); cudaMemset(h_a,0,sizeof(a)); cudaMemset(h_b,0,sizeof(b)); cudaMemcpy(h_a, &a, sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(h_b, &b, sizeof(int), cudaMemcpyHostToDevice); add<<<1,1>>>(h_a,h_b); cudaMemcpy(&c,h_a,sizeof(int),cudaMemcpyDeviceToHost); printf("%d",c); cudaFree(h_a); cudaFree(h_b); }
Thread index computation ：　

　　idx = blockIdx.x*blockDim.x + threadIdx.x:

应用

High performance math routines for your applications:
- cuFFT – Fast Fourier Transforms Library
- cuBLAS – Complete BLAS Library
- cuSPARSE – Sparse Matrix Library
- cuRAND – Random Number Generation (RNG) Library
- NPP – Performance Primitives for Image & Video Processing
- Thrust – Templated C++ Parallel Algorithms & Data Structures
- math.h - C99 floating-point Library
本文由 cococo点点创作，采用知识共享署名-非商业性使用-相同方式共享 3.0 中国大陆许可协议进行许可。欢迎转载，请注明出处：
转载自：cococo点点 http://www.cnblogs.com/coder2012
相关阅读:
关于App_Offline.htm的应用实例（及CIM_DataFile的用法）注意Windows下
 Office2007多个文档打开时，开启多个窗口（独立进程）
Asp.Net环境下web Pages，web Forms 及MVC的优越及缺点
 批量生成表Create　SQL　示例　Generate SQL Create Scripts for existing tables with Query
Different between datetime and timestamp, and its setting
SqlConnection ，SqlTransaction，SqlCommand的常用法
 ASP.NET下从Server端下载文件到Client端C#
C#中Remote文件复制简例子
 DOS BAT用法简例子
 改善SQL Procedure性能的几点方法
原文地址：https://www.cnblogs.com/coder2012/p/3056464.html