• vulkan asynchronous compute


    https://www.youtube.com/watch?v=XOGIDMJThto

    https://www.khronos.org/assets/uploads/developers/library/2016-vulkan-devday-uk/9-Asynchonous-compute.pdf

    https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization

    https://gpuopen.com/concurrent-execution-asynchronous-queues/

    通过queue的并行 增加GPU的并行

    并发性 concurrency 

    Radeon™ Fury X GPU consists of 64 Compute Units (CUs), each of those containing 4 Single-Instruction-Multiple-Data units (SIMD) and each SIMD executes blocks of 64 threads, which we call a “wavefront”.

    Since latency for memory access can cause significant stalls in shader execution, up to 10 wavefronts can be scheduled on each SIMD simultaneously to hide this latency.

    GPU有64个CU

    每个CU 4个SIMD

    每个SIMD 64blocks ----- 一个wavefront

    ps的计算在里面

    GPU提升并发性 减小GPU idel

    async compute

    • Copy Queue(DirectX 12) / Transfer Queue (Vulkan): DMA transfers of data over the PCIe bus
    • Compute queue (DirectX 12 and Vulkan): execute compute shaders or copy data, preferably within local memory
    • Direct Queue (DirectX 12) / Graphics Queue (Vulkan):  this queue can do anything, so it is similar to the main device in legacy APIs

    这三种queue对应metal里面三种encoder 是为了增加上文所述并发性

    对GPU底层的 操作这种可行性是通过这里的queue体现的

    vulkan对queue的个数有限制 可以query

    dx12没有这种个数限制

    更多部分拿出来用cs做异步计算

    看图--技能点还没点

    problem shooting

    • If resources are located in system memory accessing those from Graphics or Compute queues will have an impact on DMA queue performance and vice versa.
    • Graphics and Compute queues accessing local memory (e.g. fetching texture data, writing to UAVs or performing rasterization-heavy tasks) can affect each other due to bandwidth limitations  带宽限制 数据onchip
    • Threads sharing the same CU will share GPRs and LDS, so tasks that use all available resources may prevent asynchronous workloads to execute on the same CU
    • Different queues share their caches. If multiple queues utilize the same caches this can result in more cache thrashing and reduce performance

    Due to the reasons above it is recommended to determine bottlenecks for each pass and place passes with complementary bottlenecks next to each other:

    • Compute shaders which make heavy use of LDS and ALU are usually good candidates for the asynchronous compute queue
    • Depth only rendering passes are usually good candidates to have some compute tasks run next to it
    • A common solution for efficient asynchronous compute usage can be to overlap the post processing of frame N with shadow map rendering of frame N+1
    • Porting as much of the frame to compute will result in more flexibility when experimenting which tasks can be scheduled next to each other
    • Splitting tasks into sub-tasks and interleaving them can reduce barriers and create opportunities for efficient async compute usage (e.g. instead of “for each light clear shadow map, render shadow, compute VSM” do “clear all shadow maps, render all shadow maps, compute VSM for all shadow maps”)

    然后给异步计算的功能加上开关

    看vulkan这个意思 它似乎没有metal2 那种persistent thread group 维持数据cs ps之间传递时还可以 on tile

  • 相关阅读:
    java io 学习笔记(一)
    Centos中查看系统信息的常用命令
    arcgis影像批量裁剪代码
    VS2010中VC++目录和C/C++之间的区别。VC++ Directories和C/C++的区别。
    VS中为什么不同的项目类型属性查看和设置的界面不一样
    C++函数中返回引用和返回值的区别
    java中HashMap的keySet()和values()
    repoquery详解——linux查看包依赖关系的神器
    linux学习笔记
    log4j的AppenderLayout格式符
  • 原文地址:https://www.cnblogs.com/minggoddess/p/11636422.html
Copyright © 2020-2023  润新知