最近在公司群里同事发了一个UE4关于Mask材质的优化,比如在场景中有大面积的草和树的时候,可以在很大程度上提高效率。这其中的原理就是利用了GPU的特性Early Z,但是它的做法跟我最开始的理解有些出入,因为Early Z是GPU硬件实现的,每个厂商在实现的时候也有所不同。这次在查阅了一些资源和通过实验测试,让我们来揭开Early Z的神秘面纱。首先我们先讲解一下什么是Early Z,然后再讲解一下UE4是如何利用Early Z的特性解决草和 树的Overdraw问题的,然后我们讲解一下Early Z演化,最后我们通过实验数据来验证Early Z是如何工作的。
什么是Early Z
我们知道传统的渲染管线中,深度测试是发生在Pixel/Fragment Shader之后的,如下图所示:
但是,如果我们仔细想下,在光栅化的时候我们已经知道了每个片断(fragment)的深度,如果这个时候我们可以提前做测试就可以避免后面复杂的Pixel/Fragment Shader计算过程,硬件厂商当然也想到了这一点,他们也在自己的硬件中各自实现了Early Z功能。在网上找到了一些他们的资料,我们简单看一下。
nVidia
nVidia的GPU Programming Guide里面有关于Early Z的优化方案,里面提到了一些关于Early Z的一些使用细节。
Early-z(GPU Programming Guide Version 2.5.0 (GeForce 7 and earlier GPUs)) optimization (sometimes called "z-cull") improves performance by avoiding the rendering of occluded surfaces. If the occluded surfaces have expensive shaders applied to them, z-cull can save a large amount of
computation time. To take advantage of z-cull, follow these guidelines:
-
Don't create triangles with holes in them (that is, avoid alpha test or texkill)
-
Don't modify depth (that is, allow the GPU to use the interpolated depth value)
Violating these rules can invalidate the data the GPU uses for early
optimization, and can disable z-cull until the depth buffer is cleared again.
可以看到不要使用alpha test 或者texkll(clip discard),不要修改深度,只允许使用光栅化插值后的深度,违背这些规则会使GPU Early Z优化失效,直到下一次清除深度缓冲区,然后才能使用Early Z。限于当时的条件,是有这样的限制,那么到了现在GPU还有这些限制吗?我们接下来的实验会说明这一点。
ZCULL and EarlyZ: Coarse and Fine-grained Z and Stencil Culling
NVIDIA GeForce 6 series and later GPUs can perform a coarse level Z and Stencil culling. Thanks to this optimization large blocks of pixels will not be scheduled for pixel shading if they are determined to be definitely occluded. In addition, GeForce 8 series and later GPUs can also perform fine-grained Z and Stencil culling, which allow the GPU to skip the shading of occluded pixels. These hardware optimizations are automatically enabled when possible, so they are mostly transparent to developers. However, it is good to know when they cannot be enabled or when they can underperform to ensure that you are taking advantage of them.
Coarse Z/Stencil culling (also known as ZCULL) will not be able to cull any pixels in the following cases:
1. If you don't use Clears (instead of fullscreen quads that write depth) to clear the depth-stencil buffer.
2. If the pixel shader writes depth.
3. If you change the direction of the depth test while writing depth. ZCULL will not cull any pixels until the next depth buffer Clear.
4. If stencil writes are enabled while doing stencil testing (no stencil culling)
5. On GeForce 8 series, if the DepthStencilView has Texture2D[MS]Array dimension
Also note that ZCULL will perform less efficiently in the following circumstances
1. If the depth buffer was written using a different depth test direction than that used for testing 2. If the depth of the scene contains a lot of high frequency information (i.e.: the depth varies a lot within a few pixels)
3. If you allocate too many large depth buffers.
4. If using DXGI_FORMAT_D32_FLOAT format Similarly,
fine-grained Z/Stencil culling (also known as EarlyZ) is disabled in the following cases:
1. If the pixel shader outputs depth
2. If the pixel shader uses the .z component of an input attribute with the SV_Position semantic (only on GeForce 8 series in D3D10)
3. If Depth or Stencil writes are enabled, or Occlusion Queries are enabled, and one of the following is true:
• Alpha-test is enabled
• Pixel Shader kills pixels (clip(), texkil, discard)
• Alpha To Coverage is enabled
• SampleMask is not 0xFFFFFFFF (SampleMask is set in D3D10 using OMSetBlendState and in D3D9 setting the D3DRS_MULTISAMPLEMASK renderstate)
这是GPU Programming Guide GeForce 8 and 9 Series,可以看到它里面又加入了ZCull(即Hierachical Z)这里也有一些需要注意的地方,但是它没有详细说明如果开启了Alpha Test之后会不地导致后面的所有Early Z失效。
AMD
Emil Persson的Depth in Depth对Early Z有一个比较深入的讲解。
Hierarchical Z, or HiZ for short, allows tiles of pixels to be rejected in a hierarchical fashion. This allows for faster rejection of occluded pixels and offers some bandwidth saving by doing a rough depth test using lower resolution buffers first instead of reading individual depth samples. Tiles that can safely be discarded are eliminated and thus the fragment 1 shader will not be executed for those pixels. Tiles that cannot safely be discarded are passed on to the Early Z stage, which will be discussed later on.
The Early Z component operates on a pixel level and allows fragments to be rejected before executingthe fragment shader. This means that if a certain fragment is found to be occluded by the current contents of the depth buffer, the fragment shader doesn't have to run for that pixel. Early Z can also reject fragments before shading based on the stencil test. On hardware prior to the Radeon HD 2000series, early Z was a monolithic top-of-the-pipe operation, which means that the entire read-modify- write cycle is executed before the fragment shader. As a result this impacts other functionality that kills fragments such as alpha test and texkill (called "clip" in HLSL and "discard" in GLSL). If Early Z would be left on and the alpha test kills a fragment, the depth- and/or stencil-buffer would have been incorrectly updated for the killed fragments. Therefore, Early Z is disabled for these cases. However, if depth and stencil writes are disabled there are no updates to the depth-stencil buffer anyway, so in this case Early Z will be enabled. On the Radeon HD 2000 series, Early Z works in all cases.
最后作者还给了一个参考表,列出了在什么情况下Early Z会失效,如下图所示:
总结
通过上面两个比较陈旧的文档,我们可能会对什么时候会导致Early Z的失效比较模糊,而且随着硬件的演进,这些限制条件也会变化,后面我们通过一些实验来做些验证。
UE4对Mask材质的Early Z优化
上面简单讲了下什么是Early Z,接下来我们来解决下UE4是如何解决Mask材质带来的Overdraw问题。
它需要开启一个开关,叫做Mask Material Only in Early-Z pass
上面这个只是一个操作,那么代码是怎么实现的呢?我们这里就不贴代码了,这里只是说一下它做这个的步骤,具体代码可以去参考UE4 Pre Pass的相关代码。
-
首先UE4会把场景中所有的Opaque和Mask的材质做一遍Pre-Pass,只写深度不写颜色,这样可以做到快速写入,先渲染Opaque再渲染Mask的物体,渲染Mask的时候开启Clip。
-
做完Pre-pass之后,这个时候把深度测试改为Equal,关闭写深度渲染Opaque物体。然后再渲染Mask物体,同样是关闭深度写,深度测试改为Equal,但是这个时候是不开启clip的,因为pre-pass已经把深度写入,这个时候只需要把Equal的像素写入就可以了。这也是上面Mask Material only in early Z-pass的来历。
这就是UE4提高Mask材质渲染效率的办法,但是这个有个前提就是你场景中的Mask材质比较费才有比较大的提升。等等,它的实现方法跟我们看到的一些文章是矛盾的,而有些文档又没说清楚,既然UE4已经实现了这个功能,并且已经实现了性能提升,那说明先前的文章只针对当时的GPU有效,后面随着硬件的演进更智能了,可以处理的情况更多了。为了验证,我们做一些实验。
揭开Earlyl Z的神秘面纱
为了验证上面的一系了假设,我这里做了一个简单的实验。这个Demo的基于rastertek的Drect3D 11的教程Texturing,这个Demo就是在屏幕上渲染一个带纹理的三角形。如下图所示:
我修改了它的代码,让它在同一个位置画四个三角形,第一个三角形采用Mask渲染,第二个三角形在PS中修改深度,第三个三角形使用Mask渲染,第四个三角形使用Mask渲染,但是跟UE4一样,把深度写关闭,把深度测试改为Equal,关闭clip。测试显卡为nVidia GTX 570。这样我用GPA(intel graphics performance analyzer)分析PS调用次数以及实现Pixel的个数如下表所示:
渲染批次 |
Depth |
Clip |
PS Invocations |
Pixels Rendered |
1 Mask |
Less Write |
Yes |
10.4k |
6548 |
2 Modify depth |
Less Write |
No |
10.4k |
3820 |
3 Mask |
Less Write |
Yes |
10.4k |
0 |
4 Mask |
Equal do not Write(写不与深度不影响结果因为是equal,但是为了节省带宽关闭) |
No |
6548 |
6548 |
从上图可以看出不论是Modify depth或者Clip都只影响当前Draw call的early z优化,并不会影响后面的early z优化。可以看出,随着硬件的演化,early z(包括Hierachical Z)变得更智能了,可以处理的情况更多了。
总结
通过对Early Z的简单分析以及实验,我们得出了一个有用的结论:
-
Early Z由硬件实现,随着硬件的演进,它的功能也在不断进化,处理的情况也变多。
-
Alpha Test或者Depth modify都会使用early z失效,但是后面渲染的批次还可以继续使用early z(Hierachical Z)优化。
-
渲染API可以通过设置earlydepthstencil(d3d)或者layout(early_fragment_tests) in;(opengl)来强制使用early z。
随着硬件的演进,原来硬件的很多限制也会被解除,这样就需要我们不断学习新的知识来正确的优化我们的引擎或者游戏。