vulkan asynchronous compute

vulkan asynchronous compute
https://www.youtube.com/watch?v=XOGIDMJThto

https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization

https://gpuopen.com/concurrent-execution-asynchronous-queues/

通过queue的并行增加GPU的并行

并发性 concurrency

Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a “wavefront”.

Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.

GPU有64个CU

每个CU 4个SIMD

每个SIMD 64blocks ----- 一个wavefront

ps的计算在里面

GPU提升并发性减小GPU idel

async compute
- Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
- Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
- Direct Queue (DirectX 12) / Graphics Queue (Vulkan): this queue can do anything, so it is similar to the main device in legacy APIs
这三种queue对应metal里面三种encoder 是为了增加上文所述并发性

对GPU底层的操作这种可行性是通过这里的queue体现的

vulkan对queue的个数有限制可以query

dx12没有这种个数限制

更多部分拿出来用cs做异步计算

看图--技能点还没点

problem shooting
- If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
- Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations 带宽限制数据onchip
- Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
- Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance
Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:
- Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
- Depth only rendering passes are usually good candidates to have some compute tasks run next to it
- A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
- Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
- Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of “for each light clear shadow map, render shadow, compute VSM” do “clear all shadow maps, render all shadow maps, compute VSM for all shadow maps”)
然后给异步计算的功能加上开关

看vulkan这个意思它似乎没有metal2 那种persistent thread group 维持数据cs ps之间传递时还可以 on tile
相关阅读:
java io 学习笔记（一）
Centos中查看系统信息的常用命令
 arcgis影像批量裁剪代码
 VS2010中VC++目录和C/C++之间的区别。VC++ Directories和C/C++的区别。
VS中为什么不同的项目类型属性查看和设置的界面不一样
 C++函数中返回引用和返回值的区别
 java中HashMap的keySet()和values()
repoquery详解——linux查看包依赖关系的神器
 linux学习笔记
 log4j的AppenderLayout格式符
原文地址：https://www.cnblogs.com/minggoddess/p/11636422.html