GSoC Week 2: Parse Me Like One Of Your French Buffers

This blog post is related to my GSoC 2022 project.

As one can probably guess from the title, in the previous week I worked mainly on the EDL compiler, along with starting work on the binary buffer system.

First, me and @Josh worked on setting up a BNF grammar for EDL. This grammar contains a mostly complete expression and statement grammar. Currently, type casting and declarators are not finished as they are the most complex part of the grammar. I will need to spend more time reading the C++ declarator grammar to fully understand how it works and to implement it in the EDL spec.

Hopefully, once the BNF is done, it can be put in the wiki as a reference to anyone interested in making a parser for EDL, and also as a way to make sure the implementation is working properly. Speaking of which, I worked on implementing the general expression parser and the various statement parsers this week.

The EDL parser follows a standard RDP (Recursive Descent Parsing) strategy, with an interesting technique to handle consuming tokens. The caller is responsible for setting the token to the next token in the sequence, and the callee can decide if it wants to consume that token or not. This state is handled using the member variable token, which stores the next token in the sequence. If the callee wants to consume a token, it simply updates token like token = lexer->ReadToken(), which sets token to the next token in the sequence to be considered again.

This system is quite different from what I have learned mainly from reading Crafting Interpreters and some other literature, where the callee is responsible for both checking if it wants to consume the token and advancing from the current token. For instance, instead of something like

void callee() {
    if (peek() == SOME_TYPE) {
        advance();
    }
}

void caller() {
    callee();
}

the following would be done:

void callee() {
    if (token == SOME_TYPE) {
        token = advance();
    }
}

void caller() {
    token = advance();
    callee();
}

While these two pieces of code do the same thing (check the next token and consume if required), they do so very differently. In the first case, callee() is responsible for checking the upcoming token and moving it forward from the current token if required. In the second case, it only needs to check if the current token in the sequence needs to be consumed.

Explained simply, what I had learned before was a system based on acceptance, where the parser would accept the token only if it matched what it wanted. This new system is based on rejection, where the parser can reject the next token and return to the caller if it is not what it wanted.

The second system is simpler, because any parser only has to do one thing: move token forward. With the first system, it has to consider both the current token and the next token, which is why there is a separation between peek() and advance(). This can lead to issues as it’s possible to confuse peek() and advance(), and end up with a bug where they are interchanged.

Therefore, using this system, I implemented the following parsers:

I also worked on implementing most of the statement parsers, however they are pretty trivial and self-explanatory just from reading the code.

I then started work on refactoring and implementing the binary buffer functions, firstly by fixing the bug I described in my project about decimal digits of rval.d being truncated in the variant class’s bitwise operators. Then, I updated the BinaryBuffer class to use std::size_t and std::byte to be more idiomatic with standard C++ code. I also created a serializer and deserializer for converting to and from a variant’s data and bytes, using it in the buffer_fill() and buffer_peek() functions to read and write data from the binary buffers.

Currently, the issue with the variant class lies in the accuracy with larger numbers, as it uses the double data type as its backing store for all numbers. The problem with that is that while numbers upto 32 bits can be represented exactly in a double, 64-bit numbers cannot be represented with exact magnitude as there simply isn’t enough space in a double to store a 64-bit mantissa.

There are two solutions to this problem:

With the first solution, the variant class keeps its speed, however users have to remember to switch to an integer type whenever required, and it means that user variables when integers lose their dynamic typing, which can be confusing.

With the second solution, the dynamic typing is retained, however variant loses quite a lot of its speed, as a type lookup needs to be done each time the value stored in the variant is accessed to avoid undefined behaviour accessing a non-active member of a union.

Neither of these solutions are very pretty, however as ENIGMA is focused on running GML code as fast as possible, it makes sense to make the code a little less consistent and retain speed. Thus, the first solution makes the most sense.

In the coming week, I’ll continue working on the binary buffer functions, cleaning them up wherever required, along with making more tests for them. Hopefully, this work can be concluded within a few days, after which I can implement the remaining buffer functions for encrypting and decrypting buffers, which will conclude Part 1 of my GSoC project.