GSoC Week 2: Parse Me Like One Of Your French Buffers
27 Jun 2022This blog post is related to my GSoC 2022 project.
As one can probably guess from the title, in the previous week I worked mainly on the EDL compiler, along with starting work on the binary buffer system.
First, me and @Josh
worked on setting up a BNF grammar for EDL. This grammar contains a mostly complete expression and statement grammar. Currently, type casting and declarators
are not finished as they are the most complex part of the grammar. I will need to spend more time reading the
C++ declarator grammar to fully understand how it works
and to implement it in the EDL spec.
Hopefully, once the BNF is done, it can be put in the wiki as a reference to anyone interested in making a parser for EDL, and also as a way to make sure the implementation is working properly. Speaking of which, I worked on implementing the general expression parser and the various statement parsers this week.
The EDL parser follows a standard RDP (Recursive Descent Parsing) strategy, with an interesting technique to handle consuming
tokens. The caller is responsible for setting the token to the next token in the sequence, and the callee can decide if it wants
to consume that token or not. This state is handled using the member variable token
, which stores the next token in the sequence.
If the callee wants to consume a token, it simply updates token
like token = lexer->ReadToken()
, which
sets token
to the next token in the sequence to be considered again.
This system is quite different from what I have learned mainly from reading Crafting Interpreters and some other literature, where the callee is responsible for both checking if it wants to consume the token and advancing from the current token. For instance, instead of something like
void callee() {
if (peek() == SOME_TYPE) {
advance();
}
}
void caller() {
callee();
}
the following would be done:
void callee() {
if (token == SOME_TYPE) {
token = advance();
}
}
void caller() {
token = advance();
callee();
}
While these two pieces of code do the same thing (check the next token and consume if required), they do so very differently.
In the first case, callee()
is responsible for checking the upcoming token and moving it forward from the current
token if required. In the second case, it only needs to check if the current token in the sequence needs to be consumed.
Explained simply, what I had learned before was a system based on acceptance, where the parser would accept the token only if it matched what it wanted. This new system is based on rejection, where the parser can reject the next token and return to the caller if it is not what it wanted.
The second system is simpler, because any parser only has to do one thing: move token
forward. With the first system, it has to consider both the current token and the next token, which is why there is a separation
between peek()
and advance()
. This can lead to issues as it’s possible to confuse peek()
and advance()
, and end up with a bug where they are interchanged.
Therefore, using this system, I implemented the following parsers:
-
TryParseExpression()
: This is the entry point for expression parsing, and is responsible for dispatching to the correct sub-expression parser. It does so in a Pratt-parser format, where it tries to consume a sub-expression while the next token is one which can be an operator with the correct precedence. -
TryParseBinaryExpression()
: This is the binary expression parser. It also works in a Pratt-parser format, to reduce the number of recursive calls thatTryParseExpression()
would have to do to consume the whole expression. -
TryParseUnaryExpression()
: This is the unary postfix expression parser. It works basically the same as the binary expression parser, as unary postfix expressions can be considered to be binary expressions without the RHS operand. It also follows a Pratt-parser format to reduce the number of recursive calls. -
TryParseTernaryExpression()
: This is the ternary expression parser. It parses the conditional operator (?:
), as that has been the only real ternary expression in use for the past ~50 years that I am aware of. Unfortunately, due to the Right-to-Left associativity of the conditional operator, it cannot be arranged in a Pratt-parser format as it requires recursion to store the current expression and get the right hand side expression “from the future”, i.e. by waiting for the rest of the ternary expressions which follow to be parsed. -
TryParseSubscriptExpression()
: This is the array subscript parser. It has to be handled separately from binary expressions as unlike them, the subscript operator ([]) has an extra ‘]’ at the end, which is not handled by the generic infix parser. -
TryParseFunctionCallExpression()
: This is the function call parser. While this expression can be handled as a “binary” expression due to the ‘(‘ being present between the function being called and its arguments, it requires special logic to read in the arguments passed to the function, and also to read the closing ‘)’.
I also worked on implementing most of the statement parsers, however they are pretty trivial and self-explanatory just from reading the code.
I then started work on refactoring and implementing the
binary buffer functions, firstly by fixing the bug I described in my project about decimal digits of rval.d
being truncated in the variant
class’s bitwise operators. Then, I updated the BinaryBuffer
class
to use std::size_t
and std::byte
to be more idiomatic with standard C++ code. I also created a
serializer and deserializer for converting to and from a variant’s data and bytes, using it in the buffer_fill()
and buffer_peek()
functions to read and write data from the binary buffers.
Currently, the issue with the variant
class lies in the accuracy with larger numbers, as it uses the double
data type as its
backing store for all numbers. The problem with that is that while numbers upto 32 bits can be represented exactly in a double
, 64-bit numbers
cannot be represented with exact magnitude as there simply isn’t enough space in a double
to store a 64-bit mantissa.
There are two solutions to this problem:
-
Do not change the
variant
class, simply tell the user to use the integer types directly. -
Change the
variant
class to store bothstd::int64_t
anddouble
, using thetype
member to pick the one in use for arithmetic operations.
With the first solution, the variant
class keeps its speed, however users have to remember to switch to an
integer type whenever required, and it means that user variables when integers lose their dynamic typing, which can be confusing.
With the second solution, the dynamic typing is retained, however variant
loses quite a lot of its speed, as
a type lookup needs to be done each time the value stored in the variant is accessed to avoid undefined behaviour accessing
a non-active member of a union.
Neither of these solutions are very pretty, however as ENIGMA is focused on running GML code as fast as possible, it makes sense to make the code a little less consistent and retain speed. Thus, the first solution makes the most sense.
In the coming week, I’ll continue working on the binary buffer functions, cleaning them up wherever required, along with making more tests for them. Hopefully, this work can be concluded within a few days, after which I can implement the remaining buffer functions for encrypting and decrypting buffers, which will conclude Part 1 of my GSoC project.