GSoC Week 3: Master Of Buffers

04 Jul 2022

This blog post is related to my GSoC 2022 project.

This week, I worked on Part I of my GSoC project, i.e. the buffer functions. In this time I refactored most of the functions, and found out how to make them conform to how it works in GameMaker Studio (GMS).

Firstly, to make serialization and deserialization of data more consistent and easy to refactor, I implemented two functions to do the job: serialize_to_type() and deserialize_from_type().

These two functions are pretty simple:

serialize_to_type(): takes a variant, the type it is storing, and serializes it as a vector of bytes. For integers, the serialization is done in big endian form, where the most significant byte of the integer comes first in the stream, and the least significant byte comes last.

There is also an interesting issue here: as all numeric values in ENIGMA are stored as doubles, large values of signed and unsigned integers have to be deserialized specially. Before commit #2e8303486, the value stored in rval.d was converted to a signed integer, always. However, the problem lies in that when converting from a double to a signed integer, the top bit of the integer is reserved for the sign bit of the double. This causes an issue where unsigned values greater than the signed maximum value (which have the topmost bit set) get divided by half when converting to signed integers as that top bit is reserved for the sign bit.

The fix is fairly straightforward: check if rval.d is greater than the signed maximum value, and if so, convert to unsigned instead. This does not cause any issues, as rval.d will only be greater than the maximum value if an unsigned value is stored, whereas a signed integer would wrap around to a negative value and then rval.d would be less than the maximum value.

Unlike GMS, both buffer_string and buffer_text are treated as being equivalent, where both consist of a stream of characters terminated by a null terminator. In GMS, only buffer_text is null-terminated.
deserialize_from_type(): takes a pair of std::vector<std::byte>::iterators and the type being converted to, and returns a variant containing the deserialized value. Checks if the pair of iterators exactly represent the type being deserialized.

After this, I did some refactoring, where I renamed the macros get_buffer and get_bufferr to GET_BUFFER and GET_BUFFER_R respectively. These are the macros used to access buffers, and in debug mode, they are responsible for checking if the buffer being accessed actually exists. Then, I changed the global enigma::buffers variable to be a AssetArray<BinaryBufferAsset> instead of a std::vector<BinaryBuffer*>.

AssetArray is effectively to ENIGMA what std::vector is to the C++ standard library. It allows making a container of assets, which require three methods to be implemented:

getAssetTypeName()
isDestroyed()
destroy()

These methods make it possible to delete elements from the sequence without having to use the erase-remove idiom, which can be quite slow as it is O(n) in terms of time complexity. To implement these methods without changing the BinaryBuffer class, I created the BinaryBufferAsset class. This class simply wraps a BinaryBuffer inside a std::unique_ptr, and the methods manipulate the smart pointer. The nullptr value is considered as the BinaryBuffer within being “destroyed”.

I also gave types to three of the enums which are used for the bufer constants:

buffer_data_t: this stores the various data types which the buffer can store, like buffer_u8, buffer_u16, buffer_string and so on.
buffer_seek_t: this stores the three ways in which a buffer’s can seek to a position, namely buffer_seek_start, buffer_seek_relative and buffer_seek_end.
buffer_type_t: this stores the type of the buffer itself, i.e. how it behaves when it comes to accessing data and resizing itself. This enum consists of buffer_grow, buffer_wrap, buffer_fixed and buffer_fast. The constant buffer_vbuffer is currently not stored in this enum.

Previously, all these values were just passed around as standard integers. The problem with this was that you could pass the wrong type of value to a function and there would be no error as the enum value would be implicitly cast to an int. In fact, it was possible to not even realize that there was an error, as the erroneous value could have had the same integer value as the correct type of value.

After that, I added the ability to have sub-directories of tests within the SimpleTests/ directory. Previously, the test harness would only check for directories ending with .sog in the top level directory. However, as I have decided to make the tests for each buffer function its own separate test, having it all in the main tests directory would lead to more crowding and would make it harder to navigate through the tests. Therefore, when the test harness finds a directory ending with the .multi extension, it considers all the directories within that directory as tests too. This means that I can group all the buffer tests, each of which begin with the “buffer_” prefix into the buffers.multi sub-directory.

Finally, after having set all this up, I began work on the buffer functions themselves. The problems here did not come from the implementation being complex (in fact the functions are quite trivial in terms of functionality), they came from wanting to conform with how GMS works so that transitioning from it to ENIGMA would be easier.

So, a quick changelog of each function:

buffer_exists(): no noteworthy changes
buffer_create(): refactor into a call to make_new_buffer, which deals with AssetArray
buffer_delete(): use AssetArray’s “destroy()” method
buffer_read(): align to buffer’s alignment before reading
buffer_write(): align to buffer’s alignment before writing
buffer_fill(): make it not completely broken, align to buffer’s alignment before writing
buffer_seek(): seek to GetSize() - 1 + offset when using buffer_seek_end
buffer_tell(): no noteworthy changes
buffer_peek(): use deserialize_from_type, read integral types as double instead of long
buffer_poke(): add flag to control automatic resizing of buffer_grow, refactor into a call to write_to_buffer()
buffer_load(): make it not completely broken, switch to a call to std::transform instead of std::ifstream::read
buffer_load_ext(): switch to a call to std::transform instead of std::ifstream::read, refactor into a call to write_to_buffer()
buffer_get_type(): no noteworthy changes
buffer_get_alignment(): no noteworthy changes
buffer_get_address(): no noteworthy changes
buffer_get_size(): no noteworthy changes
buffer_resize(): no noteworthy changes
buffer_sizeof(): no noteworthy changes

Out of all of these, buffer_fill() has caused the most pain till now. While not really difficult to implement, getting it to conform (mostly) with GMS’s implementation was not fun. In fact, I have had 6 commits till now, updating it whenever I find a new edge case and test it in GMS to make sure my implementation conforms properly.

I think the reason that I have been having these issues is mainly because of buffer alignment and how I think about it versus how GMS thinks about it. Initially, I implemented buffer alignment by adding padding bytes after the written data, so that the next element being written was always aligned.

What is interesting is that GMS does the opposite: it writes padding bytes until the data being written would be aligned, and only then does it write the data itself. This means that instead of writing padding for the next element, GMS writes padding for the current element. Logically, it makes more sense as it uses less space, however it was confusing when I first encountered it when seeing buffer_fill() not write zeroes after it wrote data to a buffer.

I have tried to validate the way these functions work as much as I can, by running GMS inside a Windows VM to test how the reference implementation works, and also by writing as many tests as I can under the buffers.multi group. Hopefully, these tests cover enough edge cases so that the buffer functions work properly in normal use. The functions currently remaining for refactoring are:

buffer_save()
buffer_save_ext()
buffer_copy()
buffer_get_surface()
buffer_set_surface()

The functions currently not implemented at all are:

buffer_create_from_vertex_buffer()
buffer_create_from_vertex_buffer_ext()
buffer_save_async()
buffer_load_async()
buffer_load_partial()
buffer_compress()
buffer_decompress()
buffer_async_group_begin()
buffer_async_group_option()
buffer_async_group_end()
buffer_copy_from_vertex_buffer()
buffer_md5()
buffer_sha1()
buffer_crc32()
buffer_base64_encode()
buffer_base64_decode()
buffer_base64_decode_ext()
buffer_set_used_size()

Of these, the ones I will definitely implement are:

buffer_compress()
buffer_decompress()
buffer_md5()
buffer_sha1()
buffer_crc32()
buffer_base64_encode()
buffer_base64_decode()
buffer_base64_decode_ext()

I am also interested in buffer_load_partial() and buffer_set_used_size(). Hopefully, I can finish these functions this week, and get back to working on the EDL parser, so that I can have as much time as possible to work on Part II of my GSoC project before college begins.

Incompatibilities with GMS:

buffer_text: not null-terminated, considered equivalent to buffer_string
buffer_grow: GMS resizes by multiplying previous size by 2, ENIGMA resizes by adding +1 to previous size
buffer_get_size(): GMS does not store any extra bytes, ENIGMA always has an extra byte at the end where the next byte will be written because of how BinaryBuffer::Seek() works
buffer_poke(): resizes buffer_grow when resize flag passed
buffer_save(): saves the whole buffer, instead of GMS which seems to only save the buffer up to the point where the last byte was written

dc03

GSoC Week 3: Master Of Buffers

Related Posts

GSoC 2023 Week 19 and 20 - That's All, Folks! 29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old 10 Sep 2023

GSoC 2023 Week 15 and 16 29 Aug 2023

GSoC 2023 Week 19 and 20 - That's All, Folks!
29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old
10 Sep 2023

GSoC 2023 Week 15 and 16
29 Aug 2023