Heterogenous AoS instance encoding for a GPU-driven renderer

How to manage various instance data requirements inside a GPU-driven renderer with a unified instance representation

Oct 15, 2025

In a game engine, instances can originate from different systems such as:

A particle system
A foliage or mass-instancing system
A GPU-driven instance placement system
Entities placed by artists or controlled by gameplay code

Each system produces its own specific type of instance, which may or may not support certain features available to others such as per-instance parameters, custom colors, or other data.

In my GPU-driven renderer, almost all instances are unified into the same buffer in a AoS fashion to make the renderer as universal as possible, the idea being that one should not rewrite culling, rasterization, raytracing management and so-on unless it needs explicit control of the rendering.

However, as scenes become more complex, the required memory bandwidth and cache requirements to process the whole GPU scene increases A LOT. The unified AoS approach I use for instances means that each instance has the same size. Consider the following instance structure:

struct Instance {
    f32x4x3 local_to_world;
    f32x4x3 prev_local_to_world;
    f32x4 color;
    u32 mesh_id;
    u32 flags;
    u32x4 custom_parameters;
};

136 bytes, quite large, it doesn’t fit a NV GPU L1 cache-line (128 bytes). We could try to pack it to squeeze some bytes, some fields seems easily packable,

color: can be packed in a u32 (RGBA8)
local_to_world: we can extract the location, rotation & scale to quantize them separately.
- Rotation could be encoded inside a 48-bit quaternion or even in 30-bit if we don’t need a lot of precision. The worst-case error is around ~0.2 degrees, with the mean being ~0.1 degrees, acceptable for a lot of usecase.
- We could enforce uniform scaling to reduce the scale to one single f32 and even go further as encode it as a f16.
prev_local_to_world : could do the same as local_to_world

For flags, mesh_id, and custom_parameters, things get a lot harder. The mesh_id could be packed if you know the maximum possible ID, but flags and custom_parameters are trickier to pack. Some instance types may never use a custom parameter, while others might require all four…

We also must be careful with packing and quantizing as this will affect every instances and could result in a loss of quality or features for artists. It does not help us with instances that just doesn’t use some fields: there is a lot of data that we could drop if we knew in advance the instance type.

For example, consider a tree instance generated from a GPU procedural instancing system:

Do we really want artists to color vegetation ? Yes ? Could this simply not be a spatial hashing performed in the vertex or fragment shader to give variety ?
Do we really need custom parameters for procedurally generated instances ? This sort of system generally tradeoff flexibility with simplicity and performance. If we do we could only expose some parameters, maybe in u16/f16 or even less..
Do we really need a full-blown matrix ? Can’t we just store a 32-bit rot, a vec3 and a f32 for scale?

Simple packing can get you quite far, but in the case of a heterogeneous instance buffer, it has its limits. This can be acceptable depending on the scene and the instance types, to be honest. However, if you have a lot of vegetation, like in the example I described, where you could drop a lot of unused data, it can still result in significant bandwidth and memory waste.

SoA encoding

One way we could pack our instance structure is by culling unused fields, moving them into separate arrays, and encoding only the offset or ID to those fields inside the main structure. This isn’t a bad approach, but I quickly refrained from going further because…

SoA can be slower, especially if you always need those fields at the same time, you should always benchmark your memory accesses, in my case, passes such as the rasterization needs the whole structure and are far more sensitive to dependent random loads than the culling pass + some cache thrashing.
We still need to encode the index of the field in the structure. To be fair, we could drop it if we don’t care about wasting memory by reusing the instance ID (if you have one) to read into the SoA arrays.

AoS per-instance encoding

To solve the pitfalls I had with the SoA layout, I introduced the concept of per-instance encoding. The idea is very simple:

Instances doesn’t follow the same in-memory layout
Each instance is preceded by a discriminant indicating the instance type
The data that follows the discriminant is instance type dependent, you cannot assume its layout without fetching the type (header uint) first
To reference instances you don’t pass instance IDs anymore but instance addresses in bytes or words, IDs would just require you to align instances to the maximum instance size boundary
Every instance type can be decoded into a common instance structure
- If an instance type doesn’t encode a field from the common instance type, it is filled with a default value

In pseudo-code, it roughly looks like that:

enum InstanceType : u32 {
    Entity,
    Vegetation
}

template<typename T>
struct PackedInstance {
    InstanceType type;
    T data;
}

// In this example the unpacked instance is the same as an Entity instance
struct UnpackedInstance = EntityInstanceData;

struct EntityInstanceData {
    const TYPE: InstanceType = InstanceType::Entity;

    f32x4x3 local_to_world;
    f32x4x3 prev_local_to_world;
    f32x4 color;
    u32 mesh_id;
    u32 flags;
    u32x4 customParameters;

    UnpackedInstance unpack();
};


struct VegetationInstanceData {
    const TYPE: InstanceType = InstanceType::Vegetation;

    float3 position;
    float scale;
    uint rotation;
    u32 mesh_id;

    UnpackedInstance unpack();
};

With this base, I use a raw buffer to store and load instances, and that’s it!

There is a small memory overhead from storing the instance type, though it is heavily compensated by not storing useless data. In my tests, the ALU/bit operations cost from performing the encoding / decoding is largely compensated by the memory bandwidth & footprint gain on large scenes, not too surprising as GPUs are heavily memory bound like CPUs.

What I like about this implementation is that is very straightforward, by having a unified instance type that can be encoded / decoded to from multiple instance types.

I think one pitfall with my implementation is that you don’t really know from the decoding side what field were actually decoded and not just defaulted, though IMO you should never depend on that and just use a sentinel value / have a valid default value.

Discussion about this post

Ready for more?