Improved Spotlight Culling for Clustered Forward Shading
I've been fiddling a lot with tiled and clustered forward shading implementations in my work both professionally and personally. If you're not familiar, there has been some work in the past few years to exploit modern GPU's ability to handle dynamic loops to allow for a larger number of dynamic lights than frequently afforded by a standard forward rendering implementation. Some strategies have been to read back the depth buffer in a compute shader to identify the min/max range of a sub-frustum, frequently referred to as "Forward +" and another set that does not and instead uses fixed partitions in the frustum to identify lights may be touching with regards to scene depth. Really if you're not familiar, there's a good set of links discussing it over on Aras Pranckevičius's blog that will bring you up to speed on a lot of ideas. I'm mostly interested in techniques that do not require reading back depth information, due to the flexibility it offers on scheduling, and you can decide to implement it on either the CPU or GPU depending on what works best for a given project. Emil Persson of Avalanche has a great presentation on their research into that approach, check out the details on his site.
Besides figuring out how to efficiently test lights against a large number of clusters (since more clusters can equate to tighter culling), another huge impact on these techiques is how good the math is for testing a primitive against a given cluster. Persson discusses this at length in his slides, since there are difficulties in testing small clusters against large primitives. The traditional method of testing six planes of a camera frustum against a primitive is ineffective, leaving you with large numbers of false positives. The primitive will tend to look like more of a box than its actual round shape. I've put together a little test bed in Unity just to help me visualize different culling techniques as I've played around with this, and you can see here how badly a point light does when doing six plane/sphere intersection tests for each cluster:
Emil Persson proposed refining the sphere size smaller along each axis direction to get tighter culling. There's also another approach presented by Marc Fauconneau of Intel at Siggraph 2014 and has source code included in their clustered shading sample. The sample is somewhat focused on leveraging Intel's ISPC to compute the culling results, however the math is certainly interesting outside of that context for clustered light culling in general. His approach is to find a plane between the point light and the center of a cluster, and test each point against that plane as what I like to think of as a sort of "poor man's separating axis theorem," in that it would likely be more accurate to test each corner against a newly calculated SAT plane, but that would add a lot more overhead to the test. Here's what the previous point light test looks like changed over to use the Intel approach:
So much better! But as you may have gathered from the title of this post, I'm actually interested in spotlight culling (so I don't have an implementation or results for Persson's approach in this case). For spotlights, Persson states that they approximate the cluster with a bounding sphere and test against that, however his slides are very light on the actual specifics of the math that they use. There's obviously some issues with approximating a cluster with a sphere, especially if they start to become stretched out in depth (the density in the previous screenshots is awfully high to get clusters that are more cube shaped). On the other hand, the Intel approach just leaves spotlights to the reader's imagination, which isn't very useful either. So before I get to how I've gone about filling in the blanks on how the Intel approach might get adapted to handle spotlights, here's a quick screenshot of my own implementation of testing a spotlight using bounding spheres to approximate each cluster:
I have no idea how well my implementation matches Avalanche, and I'm not super familiar with intersection tests for spheres and cones, but my rough logic is as follows: start with two simple tests of sphere/sphere with the spotlight position+radius against the cluster sphere (early out with failure) and a point/sphere test against the sphere using the origin of the point light (early out with success if this passes). Then for the actually interesting bit, I find the "maximum angle" that would cause the sphere to intersect the cone by taking the arcsine of the sphere radius over the distance between the spotlight origin and the sphere center. I sum the resulting angle with the angle of the spotlight (or half of the angle in my sample code since the "angle" in unity is for the full spotlight width, i.e. think radius vs. diameter). This gives me an angle I can test against with the dot product of the spotlight direction. I suspect I'm struggling to describe this super clearly, so here's the source code for my test function in C#/Unity:
I'm not super jazzed that my spotlight code includes trig and inverse trig functions, and the nagging feeling that someone much smarter than me probably has a much better way of doing that test. So it's not what I would necessarily refer to as "production ready," but it does do a good job of approximating the conical shape of the spotlight! Fellow Iron Galaxy'er Nate Mefford made the useful observation that there's likely cheaper approximations of the ASin that can be used, and of the trig in general. Possibly one strategy would be to use a rough approximation as an initial early out and then refine with the full arcsine for the tightest level of culling.
On a similar note, I would also call out that theoretically the set of clusters has already been reduced at this point, especially if the code executing on the CPU where that sort of thing is easier to architect. In the code given, you start by rejecting based off of the full sphere first, and then worry about the more expensive bits. If I'm reading the slides from Persson correctly they sweep across their clusters with fasts tests against the planes of the clusters in each major axis of the view frustum (i.e. left to right, top to bottom) to find the boxy-looking culling as shown with the point light at the top of the post, and only *the* do the more expensive culling on the reduced set, which may only be doing 10% of your total cluster count. In my experience doing something like that is a lot harder in a compute shader, where I've found throwing a separate thread at each cluster to work well. Almost so well I'd use the phrase "embarrassingly parallel" but then you'd have to put me down out back for using buzzwords, but the idea is that all clusters are totally independent of each other and are just reading from a shared buffer of total lights to cull. I think possibly the approach to get some hierarchical culling on the GPU would be to do a small amount of coarse culling on the CPU. For example: split the frustum into 8 octants, build light lists for each octant, and then dispatch separate compute jobs for each one.
I'm much happier with how my approach to adapting Intel's point light test to handle spotlights has turned out as far as the raw math goes. I drew inspiration for the idea from Christer Ericson's description of testing a cone against a plane in Real-Time Collision Detection, where he uses a couple of cross products to figure out the vector on the cones cap that points towards the plane. My idea was this: start by testing the spotlight a full sphere, then orient a plane on the side of the cone and repeat the test against that plane. That oriented plane will have the vector from the spotlight position to the closest point on the cap as one of it's tangents, and the second tangent (and then the normal), can be found with the cross product of that tangent and the vector from the spotlight position to the cluster center. Here's the result of my test with the spotlight from before:
I chose this spot in particular because there's some clusters misbehaving behind the origin of the spotlight, which could be improved upon by testing yet another plane oriented with the spotlight itself (a hemisphere shape would result if you combined that with just the initial pointlight test), but I haven't decided if practice that's actually worth doing. In the given example, only 2 additional clusters would be culled. Furthermore, the previous test resulted in 588 clusters being lit, this test results in 506. There's already an improvement over the sphere approximation in this example, so it's not like we need to play catch-up by throwing even more math at the problem. The code is a bit longer than the previous, but at least avoids anything more expensive than the cross products. Here's my source code for that function, I should note that I use Fauconneau's optimized code for testing the vertices against the plane:
In conclusion, I wanted to give a little discussion on culling spot lights against clusters along with some code samples, since it seems to be a bit glossed over in some of the resources floating around out there, but the resources *are* excellent, so definitely go read those if you are not familiar with the work in culling lights with frustum clusters. Personally, I like the projected plane approach because it's very clean to have the math for the spotlights built directly out of the point light culling logic, and in my particular implementation, I'm currently seeing the best quality results with it. As I said before though, I'm not confident that I have the best approach to doing a sphere/cone test, especially with regards to performance.
Finally, I leave you with one final shot that I didn't find a reason to include earlier, of how poorly the results are by using the cone/plane math from Real-Time Collision Detection to test against six plane similar to the point light shown at the beginning, clocking in at 1408 lit clusters, for a further reminder that testing against the six planes of the cluster frustum is usually a bad idea: