my final conclusions are:
1. each thread should have its own caches, including texture cache, geometry cache, GPIT cache, file reader cache...so we can be lock-free.
2. shading grid's data structure is bad for CPU prefetching, triangles should be arranged closer.
3. we should make statistics about the bsp.
4. clip triangle against bsp leaf to make tighter bound.