GPU基本概念详解

GPU基本概念详解

§1 个 multiprocessor <-> 1个instruction unit <-> 8 个processor <-> 在一个warp中执行 <-> 32条threads

1个processor处理1条thread，所以1个warp在1个multiprocessor中需要4个clock cycles执行完成；

1个multiprocessor可以处理多个block，但是一个block只能放在一个multiprocessor中；

一个block里头有shared memory，这些shared memory分成16个banks，刚好让half-warp （一个warp有32条threads）操作。当多条processor同时要同一个bank里头的数据时，bank会broadcast，让这些thread排队，这其实是sequential的，就与CPU没有区别，没有了GPU的优点，所以要尽量避免这种情况发生。

一个processor一次只能处理一条thread，之所以比CPU快，是因为GPU有多个processor。

coalescing：global memory transactions （基本单位 half-warp：16条threads）

的时候，第一条thread对应的address必须是64bytes的倍数，然后之后31条threads 一一对应，只有这样才能批量的global memory transaction,否则thread就要一条一条做global memory transaction，效率很低。

úEach active thread is allocated some registers for the entire lifetime of the thread

úEach active block is allocated shared memory for the entire lifetime of the block

§Very fast on-chip memory

§Can be used to avoid non-coalesced global memory accesses

§Can be used to reduce global memory accesses

§

§Shared memory is organized into 16 banks, where successive 4-byte words are assigned to successive banks

§Memory load or store of n addresses by a half-warp that span n distinct memory banks can be serviced simultaneously

§If multiple addresses map to the same memory bank, the accesses are serialized

§If multiple requests for the same memory address, a broadcast occurs

同一个block中的threads共用shared memory，同一个block中的thread才能进行__syncthreads()。这个是shared memory的概念。global memory有thread的概念，但是没有block的概念，即没有将thread组织成block，让同一个block中的thread相互协作。

Shared memory is organized into 16 banks

在device的code中，如果是unqualified的变量是register memory的。
相关阅读:
STL——pair
STL——stack
Python学习之编程基础
 开学第一课，课课有总结
 DNS域名解析
 FTP文件传输服务
 DHCP原理及配置
 Linux中配置网卡
 indoe与block解析
 Linux系统安全管理
原文地址：https://www.cnblogs.com/qingsunny/p/3402732.html

最新文章
[leetCode]203. 移除链表元素
 [leetCode]59. 螺旋矩阵 II
[leetCode]34. 回文链表
 HDU
HDU
HDU
HDU
HDU
ZOJ
HDU

热门文章
HDU
HDU
ZOJ
java 大数详细讲解
 ZOJ
CF 429B B.Working out 四个角递推
 STL——queue
STL——vector
STL——map
STL——set