Friday, February 17, 2017

Stingray Renderer Walkthrough #5: RenderDevice

Stingray Renderer Walkthrough #5: RenderDevice

Overview

The RenderDevice is essentially our abstraction layer for platform specific rendering APIs. It is implemented as an abstract base class that various rendering back-ends (D3D11, D3D12, OGL, Metal, GNM, etc.) implement.

The RenderDevice has a bunch of helper functions for initializing/shutting down the graphics APIs, creating/destroying swap chains, etc. All of which are fairly straightforward so I won’t cover them in this post, instead I will put my focus on the two dispatch functions consuming RenderResourceContexts and RenderContexts:


class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderResourceContext **rrc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;

    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

Resource Management

As covered in the post about RenderResourceContexts, they provide a free-threaded interface for allocating and deallocating GPU resources. However, it is not until the user has called RenderDevice::dispatch() handing over the RenderResourceContexts as their representation gets created on the RenderDevice side.

All implementations of a RenderDevice have some form of resource management that deals with creating, updating and destroying of the graphics API specific representations of resources. Typically we track the state of all various types of resources in a single struct, here’s a stripped down example from the DX12 RenderDevice implementation called D3D12ResourceContext:


struct D3D12VertexBuffer
{
    D3D12_VERTEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12IndexBuffer
{
    D3D12_INDEX_BUFFER_VIEW view;
    uint32_t allocation_index;
    int32_t size;
};

struct D3D12ResourceContext 
{
    Array<D3D12VertexBuffer> vertex_buffers;
    Array<uint32_t> unused_vertex_buffers;

    Array<D3D12IndexBuffer> index_buffers;
    Array<uint32_t> unused_index_buffers;

    // .. lots of other resources

    Array<uint32_t> resource_lut;
};

As you might remember, the linking between the engine representation and the RenderDevice representation is done using the RenderResource::render_resource_handle. It encodes both the type of the resource as well as a handle. The resource_lut is an indirection to go from the engine handle to a local index for a specific type (e.g vertex_buffers or index_buffers in the sample above). We also track freed indices for each type (e.g. unused_vertex_buffers) to simplify recycling of slots.

The implementation of the dispatch function is fairly straight forward. We simply iterate over all the RenderResourceContexts and for each context iterate over its commands and either allocate or deallocate resources in the D3D12ResourceContext. It is important to note that this is a synchronous operation, nothing else is peeking or poking on the D3D12ResourceContext when the dispatch of RenderResourceContexts is happening, which makes our life a lot easier.

Unfortunately that isn’t the case when we dispatch RenderContexts as in that case we want to go wide (i.e. forking the workload and process it using multiple worker threads) when translating the commands into API calls. While we don’t allow allocating and deallocating new resources from the RenderContexts we do allow updating them which mutates the state of the RenderDevice representations (e.g. a D3D12VertexBuffer).

At the moment our solution for this isn’t very nice, basically we don’t allow asynchronous updates for anything else than DYNAMIC buffers. UPDATABLE buffers are always updated serially before we kick the worker threads no matter what their sort_key is. All worker threads access resources through their own copy of something we call a ResourceAccessor, it is responsible for tracking the worker threads state of dynamic buffers (among other things). In the future I think we probably should generalize this and treat UPDATABLE buffers in a similar way.

(Note: this limitation doesn’t mean you can’t update an UPDATABLE buffer more than once per frame, it simply means you cannot update it more than once per dispatch).

Shaders

Resources in the D3D12ResourceContext are typically buffers. One exception that stands out is the RenderDevice representation of a “shader”. A “shader” on the RenderDevice side maps to a ShaderTemplate::Context on the engine side, or what I guess we could call a multi-pass shader. Here’s some pseudo code:


struct ShaderPass
{
    struct ShaderProgram
    {
        Array<uint8_t> bytecode;
        struct ConstantBufferBindInfo;
        struct ResourceBindInfo;
        struct SamplerBindInfo;
    };
    ShaderProgram vertex_shader;
    ShaderProgram domain_shader;
    ShaderProgram hull_shader;
    ShaderProgram geometry_shader;
    ShaderProgram pixel_shader;
    ShaderProgram compute_shader;

    struct RenderStates;
};

struct Shader
{
    Vector<ShaderPass> passes;
    enum SortMode { IMMADIATE, DEFERRED };
    uint32_t sort_mode;
};

The pseudo code above is essentially the RenderDevice representation of a shader that we serialize to disk during data compilation. From that we can create all the necessary graphics API specific objects expressing an executable shader together with its various state blocks (Rasterizer, Depth Stencil, Blend, etc.).

As discussed in the last post the sort_key encodes the shader pass index. Using Shader::sort_mode, we know which bit range to extract from the sort_key as pass index, which we then use to look up the ShaderPass from Shader::passes. A ShaderPass contains one ShaderProgram per active shader stage and each ShaderProgram contains the byte code for the shader to compile as well as “bind info” for various resources that the shader wants as input.

We will look at this in a bit more detail in the post about “Shaders & Materials”, for now I just wanted to familiarize you with the concept.

Render Context translation

Let’s move on and look at the dispatch for translating RenderContexts into graphics API calls:

class RenderDevice {
public: 
    virtual void dispatch(uint32_t n_contexts, RenderContext **rc, 
        uint32_t gpu_affinity_mask = RenderContext::GPU_DEFAULT) = 0;
};

The first thing all RenderDevice implementation do when receiving a bunch of RenderContexts is to merge and sort their Commands. All implementations share the same code for doing this:

void prepare_command_list(RenderContext::Commands &output, unsigned n_contexts, RenderContext **contexts);

This function basically just takes the RenderContext::Commands from all RenderContexts and merges them into a new array, runs a stable radix sort, and returns the sorted commands in output. To avoid memory allocations the RenderDevice implementation owns the memory of the output buffer.

Now we have all the commands nicely sorted based on their sort_key. Next step is to do the actual translation of the data referenced by the commands into graphics API calls. I will explain this process with the assumption that we are running on a graphics API that allows us to build graphics API command lists in parallel (e.g. DX12, GNM, Vulkan, Metal), as that feels most relevant in 2017.

Before we start figuring out our per thread workloads for going wide, we have one more thing to do; “instance merging”.

Instance Merging

I’ve mentioned the idea behind instance merging before [1,2], basically we want to try to reduce the number of RenderJobPackages (i.e. draw calls) by identifying packages that are similar enough to be merged. In Stingray “similar enough” basically means that they must have identical inputs to the input assembler as well as identical resources bound to all shader stages, the only thing that is allowed to differ are constant buffer variables. (Note: by todays standards this can be considered a bit old school, new graphics APIs and hardware allows to tackle this problem more aggressively using “bindless” concepts. )

The way it works is by filtering out ranges of RenderContexts::Commands where the “instance bit” of the sort_key is set and all bits above the instance bit are identical. Then for each of those ranges we fork and go wide to analyze the actual RenderJobPackage data to see if the instance_hash and the shader are the same, and if so we know its safe to merge them.

The actual merge is done by extracting the instance specific constants (these are tagged by the shader author) from the constant buffers and propagating them into a dynamic RawBuffer that gets bound as input to the vertex shader.

Depending on how the scene is constructed, instance merging can significantly reduce the number of draw calls needed to render the final scene. The instance merger in itself is not graphics API specific and is isolated in its own system, it just happens to be the responsibility of the RenderDevice to call it. The interface looks like this:

namespace instance_merger {

struct ProcessMergedCommandsResult
{
    uint32_t n_instances;
    uint32_t instanced_batches;
    uint32_t instance_buffer_size;
};

ProcessMergedCommandsResult process_merged_commands(Merger &instance_merger, 
    RenderContext::Commands &merged_commands);

}

Pass in a reference to the sorted RenderContext::Commands in merged_commands and after the instance merger is done running you hopefully have fewer commands in the array. :)

You could argue that merging, sorting and instance merging should all happen before we enter the world of the RenderDevice. I wouldn’t argue against that.

Prepare workloads

Last step before we can start translating our commands into state / draw / dispatch calls is to split the workload into reasonable chunks and prepare the execution contexts for our worker threads.

Typically we just divide the number of RenderContext::Commands we have to process with the number of worker threads we have available. We don’t care about the type of different commands we will be processing and trying to load balance differently. The reasoning behind this is that we anticipate that draw calls will always represent the bulk of the commands and the rest of the commands can be considered as unavoidable “noise”. We do, however, make sure that we don’t do less than x-number of commands per worker threads, where x can differ a bit depending on platform but is usually ~128.

For each execution context we create a ResourceAccessors (described above) as well as make sure we have the correct state setup in terms of bound render targets and similar. To do this we are stuck with having to do a synchronous serial sweep over all the commands to find bigger state changing commands (such as RenderContext::set_render_target).

This is where the Command::command_flags bit-flag comes into play, instead of having to jump around in memory to figure out what type of command the Command::head points to, we put some hinting about the type in the Command::command_flags, like for example if it is a “state command”. This way the serial sweep doesn’t become very costly even when dealing with large number of commands. During this sweep we also deal with updating of UPDATABLE resources, and on newer graphics APIs we track fences (discussed in the post about Render Contexts).

The last thing we do is to set up the execution contexts with create graphics API specific representations of command lists (e.g. ID3D12GraphicsCommandList in DX12),

Translation

When getting to this point doing the actual translation is fairly straight forward. Within each worker thread we simply loop over its dedicated range of commands, fetch its data from Command::head and generate any number of API specific commands necessary based on the type of command.

For a RenderJobPackage representing a draw call it involves:

  • Look up the correct shader pass and, unless already bound, bind all active shader stages
  • Look up the state blocks (Rasterizer, Depth stencil, Blending, etc.) from the shader and bind them unless already bound
  • Look up and bind the resources for each shader stage using the RenderResource::render_resource_handle translated through the D3D12ResourceAccessor
  • Setup the input assembler by looping over the RenderResource::render_resource_handles pointed to by the RenderJobPackage::resource_offset and translated through the D3D12ResourceAccessor
  • Bind and potentially update constant buffers
  • Issue the draw call

The execution contexts also holds most-recently-used caches to avoid unnecessary binds of resources/shaders/states etc.

Note: In DX12 we also track where resource barriers are needed during this stage. After all worker threads are done we might also end up having to inject further resource barriers between the command lists generated by the worker threads. We have ideas on how to improve on this by doing at least parts of this tracking when building the RenderContexts but haven’t gotten around looking into it yet.

Execute

When the translation is done we pass the resulting command lists to the correct queues for execution.

Note: In DX12 this is a bit more complicated as we have to interleave signaling / waiting on fences between command list execution (ExecuteCommandList).

Next up

I’ve deliberately not dived into too much details in this post to make it a bit easier to digest. I think I’ve manage to cover the overall design of a RenderDevice though, enough to make it easier for people diving into the code for the first time.

With this post we’ve reached half-way through this series, we have covered the “low-level” aspects of the Stingray rendering architecture. As of next post we will start looking at more high-level stuff, starting with the RenderInterface which is the main interface for other threads to talk with the renderer.

4 comments: