GSoC Week 10: Serialization and Deserialization

This blog post is related to my GSoC 2022 project.

In the previous week, I was able to get a little bit of work done on the serialization and deserialization process over the weekend. I implemented functions to serialize and deserialize integral and floating types, and variant.

There are two kinds of serialization: in-place and allocating. The in-place functions directly take a byte stream to write to, and assume that there is enough space in the stream beginning from the provided pointer to store the serialized form of the data. The allocating functions make a std::vector<std::byte> and after resizing to the required size, pass it to the in-place functions then return it.

The deserialization functions do not have to deal with in-place or allocating functions, they simply start reading the byte stream from the provided pointer and return the result by value. All of the work is neatly encapsulated behind functions and a set of macros (ENIGMA_INTERNAL_OBJECT_*, which assume code with particular variable declarations).

There is some type magic going on behind the scenes to uniformly represent serialization for these three types of types. At compile time, depending on the type of the provided argument, the correct routine is chosen using the if constexpr construct and the std::is_same_v type trait (to test equivalence of types). These routines then do type-specific work.

For serialization of types whose size is known statically, there is a base type used which the given value is bitwise cast to, i.e. a memcpy from the value to an object of the base type. This base type is chosen such that bitwise operations can be done on it, and they do not introduce any problems. For example, for signed integers, their unsigned form is chosen, because trying to shift right with signed integers can cause sign extension, which is not the case with unsigned integers. For floating point numbers, no bitwise operations are possible, so the equally-sized unsigned integer types are used. For variant, the type tag is written before the data, to know how to deserialize it. As it stands currently, after each value is converted to their base type, the base type is just written LSB-first to the byte stream.

There are three helper functions provided, resize_buffer_for_*, to automate the process of resizing the buffer to the correct size before writing data to it. Together with the serialization routines, these form the basis for serialization of any data using the macros given earlier.

These methods are used in the object heirarchy which represents the game entities in terms of using inheritance for “components”, the heirarchy being:

  1. struct OBJ_base (the type defined for the user by the compiler depending on the object name)

  2. struct object_locals

  3. struct event_parent

  4. struct object_collisions

  5. struct object_transform

  6. struct object_graphics

  7. struct object_timelines

  8. struct object_planar

  9. struct object_basic

Most of these nine classes have data which needs to be serialized and the lower six classes (4 - 9) are quite trivial. This is because these are statically defined within ENIGMAsystem/SHELL/Universal_System/Object_Tiers/, whereas the top three classes are dynamically generated for the game in IDE_EDIT_objectdeclarations.h and IDE_EDIT_evparent.h. Of these, the top two are the hardest, because they require keeping track of user defined variables and potentially undeclared variables which must be instantiated at runtime. object_locals keeps track of access to undeclared variables, and OBJ_base stores the user-defined variables defined within the event.

I have begun work on generating the code for OBJ_base, which currently uses the dectrip system to represent variable types, which makes analysis harder to do. This is because dectrip represents the type as three parts: the type specifier (referred to as type, the part of the declarator which occcurs before the variable name (referred to as prefix), and the part of the declarator which appears after the variable name (referred to as suffix). This means that it does not store the structure of the declaration itself, only the string representation of it. Thus, it is not as easy to tell when a value is a pointer or a reference, or neither.

While this system will likely get ported to the new enigma::parsing::FullType, there is not enough time left in the project to finish the conversion. Thus, I will likely finish the serialization and deserialization work using the old system, and look at porting it over after GSoC ends. For now, the focus is on getting the generated code to work for the basic types, and also implement the serialization and deserialization for the main var, which is likely used more than any other.

In this week, I will probably spend the next 2-3 days working on the code generation if I get an idea of the implementation, after which I will have my exams (from August 28 - September 3). Thus, I will only be available after this period, and at that point only 9 days will be remaining in the GSoC timeline for my project.