GSoC Week 7: 'Temporary Initialization' more like 'Permanent Pain'

02 Aug 2022

This blog post is related to my GSoC 2022 project.

In the previous week, I worked on getting a new system for storing types into place and then using that system to implement temporary object initialization.

The class I created for this purpose is enigma::parsing::FullType, which stores all the data necessary to represent a C++-style declaration, including the type specifier, the declarator, and any flags attached to the declaration. This class is mostly identical to the previous jdi::full_type, except that the way declarators are stored has been redone.

jdi::full_type uses the jdi::ref_stack class, which I found lacking for my purposes involving const/volatile pointers and nested declarators (i.e. *(*x)). In the case of the latter, the way ref_stack works is that the components (i.e. the *s, &s etc.) making up the nested declarator (the ** in *(**x)[10]) are pushed onto the stack after the outer components are parsed. The design has been done this way so that dereferencing the stack involves just popping off the top component from the stack. Therefore, if we consider *(**x)[10], the stack will look like the following:

Location	Type	Meaning	Component
TOP ->	RT_POINTERTO	pointer to object	`*`
	RT_POINTERTO	pointer to object	`*`
	RT_ARRAYBOUND	array bound, size=10	`*`
BOTTOM ->	RT_POINTERTO	pointer to object	`*`

While this system makes processing and generating the declarator easier, it does not preserve the previous structure of the declarator so there is no easy way to convert a declarator to an expression, which comes into play later on. In the new system that I designed, declarators instead of being a stack like before are instead an array consisting of “nodes” i.e. components. These components are stored under one main type (DeclaratorNode), and consist of the following types:

PointerNode (representing * (pointer to object) and ::* (pointer to member))
ReferenceNode (representing & (l-value reference) and && (r-value or forwarding reference))
ArrayBoundNode (representing [x] (array bound size) where x is an arbitrary expression)
FunctionParameterNode (representing (int y, ...) (function parameters along with default values))
NestedNode (representing the *x in *(*x))

In this system, instead of storing nested declarators after the outer components, they are stored before them under the special NestedNode class which stores a declarator inside itself. This also means that a NestedNode can store a NestedNode within itself, which means that arbitrarily nested declarators can be represented this way. Therefore, the structure of the declarator is preserved. For example, the previous *(**x)[10] will look like this:

Index	Type	Meaning	Component
0	Kind::POINTER_TO	pointer to object	`*`
1	Kind::NESTED (Kind::POINTER_TO -> Kind::POINTER_TO)	nested node (containing: pointer to object, pointer to object)	`*`
2	Kind::ARRAY_BOUND	array bound, size=10	`*`

While this makes processing the declarator harder, it preserves the structure of it exactly so that reproducing it later on is easier. If the jdi::ref_stack semantics are required, there is a to_jdi_refstack() method defined so that conversion between the two is easier. Another reason this is required is to make storing default values of function parameters easier: the jdi::ref_stack class requires a jdi::AST to be stored for default values whereas EDL uses enigma::parsing::AST, so conversion functions would be required to convert the EDL AST into a JDI AST.

However, the main reason for making this class, as noted above, was converting declarators to expressions. This has to be done as the parser I am writing tries to parse a declarator first, before bailing out when encountering an expression. For example, when parsing int(*(*x + 4)), upto the + operator, the input is treated as a declarator consisting of *(*x). On encountering the +, it bails and tries to convert the previously parsed declarator to an expresion (in this case *x) then pass it as an argument to TryParseExpression which takes care of the rest. It then has to somehow signal to the outer (caller) declarator parser that this inner declarator is actually an expression, which requires storing expressions within the DeclaratorNode class created to store the declarator components.

Note, though, that this only requires changing two inner classes and not DeclaratorNode as a whole, these two being NestedNode and FunctionParameterNode, as these are the only two places in the declarator where expressions can occur. This is because the places where expressions can occur are:

Before the declarator itself, like in int(++x)
Within a parenthesised nested declarator, like in int(*x + 4)
After a parenthesised nested declarator, like in int((*x) + 5)
After a declarator, like in int(foo::bar + 5)
Within function arguments, like in foo(int(*x + 5))

Therefore, the parser needs special cases to handle all of these possibilities. The first four cases are handled within TryParseNoPtrDeclarator and the last case is handled within TryParseParametersAndQualifiers. There is a little more code to handle what should happen if an expression is found when the inner declarator is not an expression however that is all the code needed to handle functional casts.

The major downfall of this system is the inflicted complexity: it makes the code path for the <noptr-declarator> rule more complex and harder to read because the declarator and expression parsing code is combined into one. Before implementing this system, @Josh had talked about doing things differently: parse everything as an expression, and filter out whatever declarators appear within them.

At the time, I did not really understand this system, as I did not really understand that declarators themselves are generally expressions and I wanted to stick by the rules to interpret the BNF grammar literally. With his system, it would require two passes: one to parse the expression and another to verify if the expression can be a declarator. This would be a lot easier with this system, as it would just involve checking if the expression is either unary-prefix (*, & and probably &&) or binary (with the previous operators, and [] and ()) with a type as the left hand operand, and recursively apply that to the nested expressions. It would also give functional casts by default, switching to declarators wherever needed. In hindsight, this system would give cleaner code, but would introduce a second pass into parsing, which would incur additional complexity elsewhere. It would also involve a fair bit of AST rewriting when transforming the expression into the declaration node.

So, would it be better than the current system I made? It is hard to say. The current system is still not too hard to read, as it is only the one rule that is affected. It is also still a simple, one-pass system which does not rely on any AST rewriting. This week, I will plumb this new code into TryParseOperand() and TryReadStatement(), and after a few changes the parser will basically be done and I will be ready to start Part II of my GSoC project. I will not be handling non-type template arguments for now as they involve converting the EDL AST to a JDI one and array bounds expressions which require implementing the AST::Node::eval() function for all the various nodes.

dc03

GSoC Week 7: 'Temporary Initialization' more like 'Permanent Pain'

Related Posts

GSoC 2023 Week 19 and 20 - That's All, Folks! 29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old 10 Sep 2023

GSoC 2023 Week 15 and 16 29 Aug 2023

GSoC 2023 Week 19 and 20 - That's All, Folks!
29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old
10 Sep 2023

GSoC 2023 Week 15 and 16
29 Aug 2023