Offscreen Colonies
Did your virus scanner flag a download from our site? Click here to let us explain why!

Articles

Minimal engine file formats of Conspiracy intros

by BoyC / Conspiracy

The final executable for a 64k intro is nothing more than the engine packaged together with a datafile and its loader. The loader parses the datafile and builds up the structures and resources required by the engine to run the release the same way it does in the tool. Separating the code and data this way is one of the key insights in creating a modern 64k intro - the compressibility gains are insane. This article is a summary of our experiences with minimal file formats as we've been through a couple iterations.

Problem Definition

Like any data driven piece of code, a 64k engine works by having certain data structures and resources in memory that it evaluates during execution. While working in the tool these structures usually have some additional metadata that helps the editing process, but isn't actually needed for demo playback (think of things like names, graph node location, organizational folders, etc.).

The goal of a minimal format is to

  • store only those things that are required to play back the release
  • in a highly compressible format
  • in a way that minimizes the parser code required to build up the in-memory structures on load

A well designed minimal format for a 64k engine is tightly integrated with the engine itself (to avoid the need for costly conversions of the data on load while building up the needed engine structures), which means it needs to be kept in mind for every decision made in the engine itself and is thus a major driver of architecture. Another thing to keep in mind is how complicated the parser code will be. It can get complex, and it needs to function perfectly. Our biggest source of headaches has historically been the parser code for apEx, especially before a deadline at parties. A single bit placed or parsed wrong somewhere will get you hours of debugging fun in some of the most optimized-to-death code you'll ever write. In our latest tools we found a solution to this issue and it was one of the reasons for the complete rewrite of our toolchain. We'll get to that.

Example Resources

I'll be using a couple example resource types throughout this article to demonstrate some of the concepts. These are not representative of what we have in the engine, but are useful to get a feel for what sort of data we're talking about.

SplineKey
    byte position
    float value

Spline
    byte interpolation
    SplineKey[] keys
    Color toolDisplayColor

Model
    byte[] meshGeneratorParameters
    Matrix transformation
    string toolName

Object
    byte type
    ModelRef model
    ObjectRef parent
    Spline[] splines
    string toolName

Scene
    Object[] objects
    string toolName

Event
    byte type
    short startFrame
    short endFrame
    SceneRef scene
    ObjectRef camera
    byte toolPassIndex

The resources listed above are all you'd need for rudimentary 3d flyby releases with no textures and no materials, this will be the scope for this article.

V1: Naive Implementation (2002-2014: aDDict era)

The naive implementation is to just write stuff out as it is, leaving out all the tool data and cutting down the information to its smallest representation. In pseudocode the exporter for the resources above would look something like the following. Note that WriteByte/WriteFloat/etc. functions are assumed to do what they say and write out the data into a continuous file in memory or the disc to create the result.

WriteSpline( Spline s )
    WriteByte( s.keys.count )
    WriteByte( s.interpolation )
    foreach ( key in s.keys )
        WriteByte( key.position )
        WriteFloat( key.value )

WriteModel( Model m )
    WriteArray( m.meshGeneratorParameters )
    WriteMatrix( m.transformation )

WriteObject( Object o )
    WriteByte( o.type )
    WriteByte( o.parent.index )
    if ( o.type == Model )
        WriteByte( o.model.index )
    WriteByte( o.splines.count )
    foreach ( spline in o.splines )
        WriteSpline( spline )

WriteScene( Scene s )
    WriteByte( s.objects.count )
    foreach ( object in s.objects )
        WriteObject( object )

WriteEvent( Event e )   
    WriteByte( e.type )
    WriteShort( e.startFrame )
    WriteShort( e.endFrame )
    if ( e.type == SceneRender )
        WriteByte( e.scene.index )
        WriteByte( e.camera.index ) 

WriteProject( Project p )
    WriteByte( p.models.count )
    foreach ( model in p.models )
        WriteModel( model )
    WriteByte( p.scenes.count )
    foreach( scene in p.scenes )
        WriteScene( scene )
    WriteByte( p.events.count )
    foreach( event in p.events )
        WriteEvent( event )

This works fine for a start, now let's see what the importer code would need to do:

Spline ReadSpline()
    Spine s = new Spline
    s.interpolation = ReadByte()
    s.keys = new SplineKey[ ReadByte() ]
    foreach ( key in s.keys )
        key.position = ReadByte()
        key.value = ReadFloat()
    return s

Model ReadModel()
    Model m = new Model
    ReadArray( m.meshGeneratorParameters )
    m.transformation = ReadMatrix()
    return m

Object ReadObject()
    Object o = new Object
    o.type = ReadByte()
    o.parent.index = ReadByte()
    if ( o.type == Model )
        o.model.index = ReadByte()
    byte count = ReadByte()
    for ( x = 0 to count )
        o.splines.Add( ReadSpline() )
    return o

Scene ReadScene()
    Scene s = new Scene
    byte count = ReadByte()
    for ( x = 0 to count )
        s.objects.Add( ReadObject() )
    return s

Event ReadEvent()
    Event e
    byte eventType = ReadByte()
    if ( eventType == EndDemo )
        e = new Event_EndDemo
    if ( eventType == SceneRender )
        e = new Event_SceneRender
    e.type = eventType
    e.startFrame = ReadShort()
    e.endFrame = ReadShort()
    if ( e.type == SceneRender )
        e.sceneIndex = ReadByte()
        e.cameraIndex = ReadByte()
    return e

Project ReadProject()
    Project p = new Project
    byte count = ReadByte()
    for ( x = 0 to count )
        p.models.Add( ReadModel() )
        p.models.last.Generate()
    count = ReadByte()
    for ( x = 0 to count )
        p.scenes.Add( ReadScene() )
    count = ReadByte()
    for ( x = 0 to count )
        p.events.Add( ReadEvent() )
    return p

As you can see the importer code actually has to do more work than the exporter. It needs to create all the resource objects, fill them with data appropriately, run the generator code for models, and this simple example even omits the proper referencing of object parents for the sake of cleanness (turning indices to object pointers is left to the reader as an exercise).

The resulting datafile is a raw stream of well defined data.

Example exporter source code for a system like this

Example parser source code for a system like this

V2: Streams (2014-2023: apEx era)

The main issue with the system above is not the parser code, it is compression. Random types of data follow each other randomly. An event type here, a frame index there, splines alternate bytes and floats, this all hurts compression. The solution: data streams.

The idea is simple. Put similar data next to each other. To do this we create several streams to host similar data instead of just having one output for everything. Good real-life examples are a stream for matrices, a stream for text data, a stream for timeline frame values, etc. Looking at the code for our example resources let's define the following streams:

  • Main - for misc data
  • Parameters - for model parameters
  • Matrices - for matrices, obviously
  • SplineKeyPositions - for the x position of spline keys
  • SplineKeyValues - for the y position of spline keys
  • ReferenceIndices - for all types of references
  • EventTimings - for event start and end frames

This means all our low-level writing functions like WriteByte, WriteFloat, etc. need to have an additional parameter: which stream to write to. Here's the updated exporter pseudocode:

WriteSpline( Spline s )
    WriteStreamByte( s.keys.count, Streams::Main )
    WriteStreamByte( s.interpolation, Streams::Main )
    foreach ( key in s.keys )
        WriteStreamByte( key.position, Streams::SplineKeyPositions )
        WriteStreamFloat( key.value, Streams::SplineKeyValues )

WriteModel( Model m )
    WriteStreamArray( m.meshGeneratorParameters, Streams::Parameters )
    WriteStreamMatrix( m.transformation, Streams::Matrices )

WriteObject( Object o )
    WriteStreamByte( o.type, Streams::Main )
    WriteStreamByte( o.parent.index, Streams::ReferenceIndices )
    if ( o.type == Model )
        WriteStreamByte( o.model.index, Streams::ReferenceIndices )
    WriteStreamByte( o.splines.count, Streams::Main )
    foreach ( spline in o.splines )
        WriteSpline( spline )

WriteScene( Scene s )
    WriteStreamByte( s.objects.count, Streams::Main )
    foreach ( object in s.objects )
        WriteObject( object )

WriteEvent( Event e )   
    WriteStreamByte( e.type, Streams::Main )
    WriteStreamShort( e.startFrame, Streams::EventTimings )
    WriteStreamShort( e.endFrame, Streams::EventTimings )
    if ( e.type == SceneRender )
        WriteStreamByte( e.scene.index, Streams::ReferenceIndices )
        WriteStreamByte( e.camera.index, Streams::ReferenceIndices )    

WriteProjectStreams( Project p )
    WriteStreamByte( p.models.count, Streams::Main )
    foreach ( model in p.models )
        WriteModel( model )
    WriteStreamByte( p.scenes.count, Streams::Main )
    foreach( scene in p.scenes )
        WriteScene( scene )
    WriteStreamByte( p.events.count, Streams::Main )
    foreach( event in p.events )
        WriteEvent( event )

After we're done writing the streams all we need to do is to put the various streams one after the other and create a header that tells us which stream begins where:

FinalizeProject()
    int startIndex = 0
    foreach ( stream in streams )
        WriteInt( startIndex )
        startIndex += stream.length
    foreach ( stream in streams )
        Write( stream, stream.length )

This function shows the core idea for writing out the stream start header and the streams after it. It doesn't need to support a variable number of streams because the exporter and parser always need to be in sync in knowing what is placed where, which means the number of streams is hardcoded in both.

The resulting file will have a small header section and the exporter/parser code will be a tiny bit more complex, but doing this well will allow the executable compressor to work a lot better. Deciding on what streams to create and what to put where is a bit of black magic, we've always done it by feel, there's no perfect recipe. You need to experiment, and in this sense this is a bit similar to how Crinkler tries to rearrange functions to see if it can spare a couple more bytes. For 64k once you have a good baseline it quickly goes into diminishing returns territory.

One interesting thing about this is that streams can be reused for various purposes during the export process. In the example above we could easily reuse the Parameters stream to act as the EventTimings stream for example because all models are exported before the events and so placing the event positions into the Parameters stream will result in the same effect we want: it'll separate the two types of data into continuous blocks.

Example exporter source code for a system like this

Example parser source code for a system like this

Issues

By the time we did Clean Slate we'd been using the above described systems for almost 18 years, and while they worked, they had started becoming limiting in various ways:

  • The parser code for the Clean Slate project file was 4489 bytes compressed, or about 6.9% of the release fourth only behind the synth, the mesh generator and the complete rendering pipeline
  • Debugging the exporter->parser pipeline was a nightmare. Several non-Conspiracy projects have died at various parties because of tool vs engine rendering issues, all coming from either an exporter or an importer bug that were not found in time
  • Validation on export was almost non-existent and would have been very hard to implement bug-free, if at all possible. Thus any issues that would pop up would only do so when executing the release and watching it to the end. If there was a bug somewhere, tracing that was a nightmare (see above).
  • Identifying resources inside the parser code for debugging purposes was an afterthought, which made things even harder to debug. ("Ok, so WHICH model is it trying to load now exactly?!?")

V3: Data driven data (2023-present: Archon era)

By January of 2021 when we started working on the new generation of our demotool, these issues were crystal clear, and they were one of the core issues to be solved (along stuff like reduced engine size, gpu particles and undo support for the tool). We knew we wanted to keep the basic layout of the data file with all the streams that were optimal for the exe packer, but there just had to be a better way to load the projects.

Thus, the idea of the infamous Józsi (Joe) algorithm was born (ask me at a party, I'll tell you the story behind the nickname).

The idea was that if separating code and data worked so well for compression when moving from hardcoded releases to the project file style, maybe it would be possible to separate the "data" (the project file structure) in the parser from the actual code as well. We thought that it should be possible for each resource to somehow determine how much of each stream belongs to them in a data driven manner, skip ahead that much, and move on to the next resource.

In pseudocode this looks something like this:

ParseResource( ResourceLocation r, Stream streams[] )
    memcpy( r.streams, streams, sizeof( streams ) )
    foreach ( stream in streams )
        stream += STREAM_SIZE_MAGIC( stream, r )

ParseProject()
    Streams streams = InitStreams()
    ResourceLocation resourceLocations[] = new ResourceLocation[ ReadInt( streams[ Streams::Main ] ) ]
    foreach ( location in resourceLocations )
        ParseResource( location, streams )

So basically take the current location of each stream, copy it to some object that represents the raw data for each resource, skip the data SOMEHOW, rinse, repeat. The thing about this arrangement is that it naturally gives rise to a resource representation that could even be used directly without the need to create additional objects and all the new mess in the parser above. This was a very unexpected but welcome idea that ended up shaping the architecture of the whole engine, along with major parts of our tooling.

Let's get back to that later though and first figure out the STREAM_SIZE_MAGIC part. The examples above give us a good idea on what sort of data we're dealing with. For example when reading a Model resource, we need to skip sizeof( model.meshGeneratorParameters ) bytes in the Parameters stream and skip sizeof( Matrix ) bytes in the Matrices stream. For splines we need to skip 2 bytes in the main stream, MainStream[ 0 ] bytes in the SplineKeyPositions stream and MainStream[ 0 ] * sizeof( float ) bytes in the SplineKeyValues stream. If we had a text stream we could have a case where we'd need to skip until the next 0 for an ASCIIZ string. Various combinations of such small operations can easily be encoded for each resource and for each stream that they actually utilize.

Our example resources above also show that some resources can contain other resources (for example Objects contain Splines). The way we solved this is to have all the resources that are contained in another resource be represented as a continuous block of resources in the project resource array. Each MinimalResource then has a pointer to this continuous block (its subresources) and a counter for how many subresources it has. During parsing the stream descriptors are also used to enumerate how many subresources are contained in a given resource.

Let's see how the stream descriptors would look for the example resources:

Spline
    Streams::Main
        Read 2 bytes (key count + interpolation)
    Streams::SplineKeyPositions
        Read x bytes where x is this.streams[ Streams::Main ][ 0 ]
    Streams::SplineKeyValues
        Read x * sizeof( float ) bytes, where x is this.streams[ Streams::Main ][ 0 ]

Model
    Streams::Parameters
        Read sizeof( Model.meshGeneratorParameters ) bytes
    Streams::Matrices
        Read sizeof( Matrix ) bytes

Object_Camera
    Streams::Main
        Read 1 byte (object type)
        Read 1 byte and add it to the subresource count (Spline count)
    Streams::ReferenceIndices
        Read 1 byte (parent object index, may not be loaded yet so no reference storage here)

Object_Model
    Streams::Main
        Read 1 byte (object type)
        Read 1 byte and add it to the subresource count (Spline count)
    Streams::ReferenceIndices
        Read 1 byte (parent object index, may not be loaded yet so no reference storage here)
        Read 1 byte, find the model resource it refers to and store a pointer (model reference)

Scene
    Streams::Main
        Read 1 byte and add it to the subresource count (Object count)

Event_EndDemo
    Streams::EventTimings
        Read 4 bytes (start+end frames)

Event_RenderScene
    Streams::EventTimings
        Read 4 bytes (start+end frames)
    Streams::ReferenceIndices
        Read 2 bytes, find the resources they refer to and store the pointers (scene+camera references)

This information can be encoded in a couple bytes per resource per stream and the parser only needs to execute these commands to figure out the length of each stream. All this of course requires that we already know what type of resource we're trying to read, which is represented as a separate stream as one byte per resource (this can be optimized with container resources that tell the system that they contain a certain number of the same resource type).

So the datafile is expanded with a new section in the beginning: a stream-parser-descriptor for all the resource types contained in the project, thereby making the project file self-describing. There are some smaller complications around how resources reference each other that we're still working on for optimal data storage, but the key objectives of reducing the parser code's complexity and size are achieved perfectly with this system. The new parser code is around 10% of the size of the one in Clean Slate at around 460 bytes.

With a semi-breadth-first exporting scheme the resources can be exported in an order that will allow the use of a single pointer per resource to build the resource hierarchy as well. For example when exporting a scene resource you need to reserve enough resources in the output array for all the objects inside the scene. Then when you export an actual object you reserve another block for the splines contained in the object after the currently reserved resource blocks. This way the layout of the project will be something like S,O,O,O,O,O,s,s,s,s,s,s,s,s,s,s,s where capital S stands for a scene resource, O stands for an object resource and s stands for a spline resource. In this example the subresources pointer of the scene object would point to the first O and the subresource count would say 5 because there are 5 objects in the scene. Similarly each object would point to somewhere in the list of splines to find its own splines.

Resulting engine coding changes

The idea described above completely shaped the development of the new engine. We noticed a while back that our earlier code that was closer to C than C++ tended to compile smaller, so the idea of throwing out all the virtual function stuff was already on the table. Exploring this memcpy-skip idea just kicked it to the next level. What if we only had a single generic MinimalResource class that had a type field and a list of pointers to the streams where its data started? Add some generic variables to use as a working area, a list of references to other resources (that can have special store operations in the stream descriptors), a list of pointers for outputs and we have a universal building block that can describe all of the example resources above:

struct MinimalResource
{
  ResourceType type;
  unsigned char* streams[ (unsigned char)ResourceStream::MAX ];

  int subResourceCount;
  MinimalResource* subResources;

  MinimalResource* references[ 256 ]; // automatically resolved on parse
  float intermediateParameters[ 64 ]; // working area for temp values calculated from streams
  void* resultArray[ 16 ]; // resource generation output where needed
};

With this comes a radical rethinking of how we write engine code: now we have to work from the raw stream data everywhere. There's a whole lot of puzzle solving involved to get this to work because the data in the streams is not as nicely packed as proper C structs would be, but after you manage to switch to the new way of thinking it's a lot less daunting.

This also gets rid of all the C++ stuff we wanted to get rid of. A project can now be represented as an array of MinimalResources. For precalculation we call a Prepare() function on each resource in the array. The Prepare() function has a switch() based on the resource type and does all the necessary things like allocating buffers, rendering textures, calculating mesh generation, you name it. We also have a Process() function that does very much the same but aimed at in-flight use. For example in the case of the Scene Render Event the Process() function renders the appropriate scene. Coding like this is very much anti-OOP, but the resulting code is smaller for various reasons:

First, the complex parser code is GONE.

Second, all the object creation code is GONE.

Third, all the individual object initialization code is GONE.

Also, it is my suspicion that accessing data directly from the streams is similar in size to accessing member variables in an object, but I haven't verified this. If this were the case, this is a net win all over.

In the engine.

The tooling however... Well.

Fallout

So while all this sounds great from an engine perspective, it means that if we want to use the same engine in any WYSIWYG tool (WE DO) we'll need to work directly with the final binary format that will go into the data file (since the engine expects that data to be in the resource streams). We solved this issue by creating a translation layer between the tool-side fat resource descriptions and the engine-side minimal resource descriptions. When a change occurs in the tool the affected minimal resources are rebuilt. The engine can then work the same way it will in the final release, with the same data. The main issue we faced was that in the final executable the project forms a single continuous block of MinimalResource objects while these objects are scattered throughout the memory in the tool. There's a tool only code path to get around this where the MinimalResource* references[ 256 ]; array is replaced with a MinimalResource** references[ 256 ]; array and we have a getter function that is changed in an engine vs tool build to always fetch the appropriate resource.

In short, all this would have been pretty much impossible to do retroactively on a 14 year old codebase. Being in the middle of a full rewrite anyway gave us the perfect opportunity to figure this out.

From a code organization standpoint all this helped in a way, because we now have all the tool->engine bridge functionality in one place. We also have all the individual resource stream skip descriptors built in the same file on startup for later use on export.

However the most unexpected side effect came after 2.5 years of development when the very first trials were done to see if this system can work at all in practice. Since we have the resource stream descriptions available in the tool during export and we're using the exact data in the MinimalResources that we need to export already, exporting a resource is easy:

ExportResource( MinimalResource r )
    for ( x = 0 to ResourceStream::MAX )
        WriteArray( r.streams[x], STREAM_SIZE_MAGIC( r.streams[x], r ), x )

Which is to say for each stream we check how large the data in the current resource should be via the same parser skip code we'll be using in the engine and just write out as many bytes to the appropriate stream as it says.

If this wasn't cool enough, the way we can do the export is to first build an array of MinimalResources required in the project and just go through that array one by one to do the actual export. Remember how the project in the player is just an array of MinimalResources now? It's the same thing. This means that we just built a reference project we can compare to. If we could only figure out how the parser code in the player will load the project... Which we can! Just run the ParseProject() function on the newly created minimal binary blob to get a second array of MinimalResources. This new array and the original reference can then be compared to each other resource by resource, stream by stream, using the stream descriptors to determine how many bytes you need to compare. If any of the streams don't match up, there's either an export or an import failure. ON EXPORT VALIDATION, basically for free. This was huge, and basically gets rid of ALL player side import debugging, which was the second big thing we wanted to achieve.

But there's more: since the project is just an array of MinimalResources we can export a text file where we list each resource with resource index, name, guid, type, whatever information we need. This solves the resource indexing problem where there's no way of knowing what each resource is in the player. For the occasional debugging session while getting the exporter in tune this is invaluable and allowed us to track an issue during the development of Empires in about 10 minutes instead of hours.

So. That's the story of how an idea for a better minimal file format changed the complete architecture of our engine and some of our tools. It was quite a journey and it's still not over, but man, was it worth it.

A note on floating point values

Didn't know where to squeeze this in, so here goes. We use 16 bit floating point values everywhere in our minimal format. This used to be easy with D3DX, but with that gone from compo rules we needed a fix. The original solution was some bulky code from the internet, but there's a better way:

struct Float16
{
  unsigned short value;
};

__declspec( noinline ) Float16 FloatToHalf( const float& value )
{
  __m128 float_vector = _mm_set_ss( value );
  __m128i half_vector = _mm_cvtps_ph( float_vector, _MM_FROUND_CUR_DIRECTION );
  int resultInt = _mm_extract_epi16( half_vector, 0 );
  return *(Float16*)&resultInt;
}

__declspec( noinline ) float HalfToFloat( const unsigned short& half_value )
{
  __m128i half_vector = _mm_set1_epi16( half_value );
  __m128 float_vector = _mm_cvtph_ps( half_vector );
  float result;
  _mm_store_ss( &result, float_vector );
  return result;
}

Note that if you're storing half values you'll need to make sure that the tooling also uses them in the appropriate places, otherwise you'll get discrepancies between the engine and the release, which will probably come at the worst time before a deadline and manifest in ways like cameras clipping into objects, objects being slightly in the wrong location and other fun stuff like that :)