GSoC Week 4: Buffer, buffer on the wall, who's the fairest of them all?

This blog post is related to my GSoC 2022 project.

In the previous week, I worked on finishing up the functions which I covered in last week’s writeup. These included functions for saving buffers, copying them, calculating checksums, compression / decompression and encoding.

buffer_copy, buffer_save and buffer_save_ext are pretty mundane functions, and they work pretty intuitively.

Both the buffer save functions truncate the file on writing, i.e. the previous contents of the file are cleared before writing to it. One interesting optimization that comes into effect here is using the native std::ofstream::write function versus using std::transform with a std::ofstreambuf_iterator. The problem lies in the buffers being a std::vector<std::byte>, whereas the file writing functions expect a stream of const char *.

With std::ofstream::write, the code looks like this:

file.write(reinterpret_cast<const char *>(bytes.data()), size);

With std::transform, the code looks like this:

std::transform(bytes.begin(), bytes.end(), std::ostreambuf_iterator<char>(file),
               [](std::byte b) { return static_cast<char>(b); });

where file is a std::ofstream and bytes is a std::vector<std::byte>.

With std::transform, which I would say is the idiomatic way to do things, the bytes are written one-by-one, which means that it will be harder to have fast batch writes and instead can lead to multiple writes being done at a slower speed. Therefore, I made two functions, write_to_file and read_from_file, which do a reinterpret_cast before calling std::ofstream::write and std::ifstream::read.

While I have not benchmarked this, it should allow the filesystem or the kernel to schedule writes and reads more efficiently instead of one byte at a time. The only problem is that these functions are pretty unsafe as they take a pointer and a number of bytes, so there is no way to verify if there is any undefined behaviour in accessing unowned memory. It is up to the caller to ensure that everything is within the correct bounds.

With these three functions done, I moved onto buffer_base64_encode and buffer_base64_decode, which are the functions which (again, pretty intuitively) handle encoding and decoding a buffer in the base64 format. For these functions, I wrote two functions to abstract the logic, internal_buffer_base64_encode and internal_buffer_base64_decode, though I do wonder how useful they really are given that they can just be inlined into the main functions. Anyways, base64 encoding is quite interesting. Instead of being used for encrypting data or compressing it, base64 encoding is done to make sure that data can be transmitted across machines where the only guarantee is they all understand the ASCII encoding.

To do this, base64 considers the input sequence as sets of triplets of bytes (i.e. 24 bits) and converts them into quadruplets of 6-bit units. The reason 3 bytes are used is because 24 is the lowest common multiple of 8 bits and 6 bits, and the reason 6-bit units are used is because the range of values that can be stored in 6 bits is 0 to 63, which means that each possible value can be mapped conveniently to 26 uppercase ASCII English characters, 26 lowercase ASCII English characters, 10 numeric ASCII digits and 2 extra characters (usually + and /). Then, these characters are written to a string and transmitted.

The mapping from bits to ASCII is exactly in the order described before:

In case the size of the input is not a multiple of 3, two situations can occur:

Examples for each case:

inputstr
decimal115116114
bits011100110111010001110010
decimal28551750
outputc3Ry
inputst
decimal115116
bits011100110111010000000000
decimal2855160
outputc3Q=
inputs
decimal115
bits011100110000000000000000
decimal284800
outputcw==

During decoding, the exact reverse is done: each group of 4 6-bit units is converted into a triplet of bytes, using the inverse of the mapping described above. Note that the padding character (=) is considered to map to the value 0.

It is also possible that the stream is not properly terminated with padding characters, in which case the string may have 2 or 3 extra characters at the end. It is not possible for a stream to have only one extra character, as each byte maps to at least 2 base64 characters (8 bits requires at least 12 bits, i.e. 2 6-bit units).

With these done, I moved onto CRC-32. This is very straightforward, just using the crc32() function from the zlib library which is a dependency of ENIGMA. One problem is, however, that the generator polynomial that zlib uses for its crc32 calculation is different from the one used by GameMaker Studio, so I have no way of knowing how the one in GMS behaves. I have tried to approximate what I think makes sense, wrapping the offset with buffer_wrap and truncating the write when it goes off of the buffer end.

Also using zlib, I implemented buffer_compress and buffer_decompress, using the standard Z_DEFAULT_COMPRESSION mode which is also used by GMS. Both of these functions compress / decompress chunks of 16k of data at a time.

I then implemented the MD5 and SHA1 checksum functions, buffer_md5 and buffer_sha1. Unlike before, these functions do not wrap the offset when using buffer_wrap, they instead return the checksums for an empty stream. The problem with these functions lay in whether to add a dependency or not. OpenSSL is too large to be used for just 2 functions, and while the Boost libraries do have these functions, they are part of the internal workings thus their API is not stable. The same issue arose with other libraries, where along with adding another dependency, there would also have to be licence considerations.

Therefore, I decided to use the reference implementations from the RFCs for the MD5 and SHA1 algorithms. This way, there would be no issues with having to add additional licenses for third party code. While the SHA1 code was pretty easy to integrate, the MD5 code was so old it used ANSI C style function declarations, where the types of parameters were specified separately from the names of the parameters. It also had a macro PROTO_LIST, where it would not emit function parameters if the compiler being used did not support them. So, I modernized the code a little, using fixed width integer types both in the MD5 and SHA1 code, and “modern” function declarations.

buffer_get_surface and buffer_set_surface are pretty simple. The former copies a surface’s texture data to a buffer, and the latter does the opposite, writing a buffer to a surface’s texture. Simply put, a surface is an area which can be drawn on, like the viewport which the user sees on their monitor. Each pixel’s color value is stored in ENIGMA using the BGRA format where 1 byte is used for each component, which means 4 bytes are required per pixel.

buffer_load_partial is also quite straightforward, abort if the file offset is greater than the file size, truncate if the bytes being read go beyond the end of the file, grow the destination buffer if it is buffer_grow, and wrap the offset for buffer_wrap. Finally, I implemented some string functions (at the request of @R0bert) using the MD5 and SHA1 implementations I had used. These are:

These functions are very straightforward too, just calculating the checksums of the strings or the files passed to them. With the _unicode functions, the string is considered to be in the UTF-16LE (little endian) format, and is first converted into a byte stream before its checksum is calculated.

Now that the only thing that remains in Part I of my project is documenting the buffer functions, I will shift my focus back to the EDL compiler. I have already worked on finishing the BNF grammar for EDL, and am coordinating with @Josh on finishing up the parser implementation. Therefore, I will work on the EDL parser this week, and also try finishing up the buffer functions documentation using doxygen comments in the buffers.h header file, as the enigma wiki seems to be in a non-functional state currently.

Incompatibilities with GMS: