mali compute shader opt

mali compute shader opt
https://community.arm.com/developer/tools-software/graphics/b/blog/posts/arm-mali-compute-architecture-fundamentals

So what do the Midgard architectural features actually mean for optimising compute kernels? I recommend:
- Having sufficient instruction level parallelism in kernel code to allow for dense packing of instructions into instruction words by the compiler. (This addresses the VLIW-ness of the architecture.)
- Using vector operations in kernel code to allow for straightforward mapping to vector instructions by the compiler. (I will have much more to say on vectorisation later, as it's one of my favourite topics.)
- Having a balance between A and LS instruction words. Without cache misses, the ratio of 2:1 of A-words to LS-words would be optimal; with cache misses, a higher ratio is desirable. For example, a kernel consisting of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.
- Using a sufficient number of concurrently executing (or active) threads per core to hide the execution latency of instructions (which is the depth of a corresponding pipeline).
- The maximum number of active threads I is determined by the number of registers R that the kernel code uses: I = 256, if 0 < R ≤ 4; I = 128, if 4 < R ≤ 8; I = 64, if 8 < R ≤ 16.
- For example, kernel A that uses 5 registers and kernel B that uses 8 registers can both be executed by running no more than 128 threads per core.
- This means that it may be preferable to split complex, register-heavy kernels into a number of simpler ones.
- (For compiler folk among us, this also means that the backend may decide to spill a value to memory rather than use an extra register when its heuristics suggest that the number of registers to be likely required is approaching 4 or 8.)
In some respects, writing high performance code for the Mali GPUs embedded in SoCs is easier than for GPUs found in desktop machines:
- The global and local OpenCL address spaces get mapped to the same physical memory (the system RAM), backed by caches transparent to the programmer. This often removes the need for explicit data copying and associated barrier synchronisation.
- Since all threads have individual program counters, branch divergence is less of an issue than for warp-based architectures.
相关阅读:
1649. 超级棒棒糖
 1872. 连接棒材的最低费用
 二叉树的层级遍历转换
 ZMQ的三种消息模式
 logging日志
 Svn基本使用
 Pycharm快捷键
 Redis安装和连接
 整形转中文
 C# Socket连接无法访问已释放的对象
原文地址：https://www.cnblogs.com/minggoddess/p/12625085.html

热门文章
java.lang.OutOfMemoryError: PermGen space
java中的乱码问题
 排序算法
 互联网找工作现状
 XML标签
 JSTL函数
 SQL标签
 格式化标签
 核心标签
 1562. 餐厅的数量