GSoC 2023 Week 5 - SetVector Speedup
09 Jun 2023
This blog post is related to my GSoC 2023 Project.
This week, I was able to get some pretty good results: 0.4% geomean, just by changing the type of
PruningList
. This also led to finding major performance improvements in SetVector
.
While looking through the DAGCombiner
code, I noticed one thing: the pruning list stores pointers, but it
doesn’t use the standard pointer container: SmallPtrSet
. At the time, I didn’t know that
the insertion order into pruning list matters as much as the worklist, this is because operands that do have uses of
nodes which get pruned get put into the worklist to be processed.
Thus, if the iteration order over the pruning list is not well-defined, the order of combinations done by the
DAGCombiner
also become undefined, which is obviously not good. This is the issue that is caused by
using SmallPtrSet
. I still think the solution I came up with is quite interesting, so I’ll explain it a
little.
When I switched to using only one SmallPtrSet
, the problem arose that entries may be added to the pruning
list while pruning is being done. The solution that I had come up with involved going over the whole pruning list
without deleting anything, then clearing it at the end. However, as nodes could be added to it while iterating, this
would invalidate the iterators and cause a crash when iterating. So, I had to somehow prevent the iterators from being
invalidated while I iterated over them.
The solution that I came up with for this was: create two pruning lists, one which stores the nodes which are
currently being pruned, and one which stores the nodes which get added by calls to
recursivelyDeleteUnusedNodes
. Then, a pointer is used to decided which list is being modified currently
by calls to RemoveFromWorklist
. I like to compare this system to a double buffer on a GPU, where one
buffer stores the frame currently being shown (front buffer, i.e. the pruning list which is currently being cleared),
and one buffer where the actual rendering work is done (back buffer, i.e. the pruning list where nodes are being
added).
Then, once the back buffer is ready to be shown, the pointers are flipped and the back buffer becomes the front buffer and vice versa. I did the same thing with the pointer, once one list was empty the other was set to be the one which was to be emptied, and it kept flip-flopping this way until both lists were empty. This gave a pretty damn good result: 0.21% geomean.
Of course, my mentor came up with a much better solution: why not just not modify the pruning list while it is being cleared? This made a lot of sense, the only nodes which get added to the pruning list while it is being cleared are those which have uses, so they won’t be pruned anyways. So, by stopping that from happening, the iterators are not invalidated and the whole flip flopping system isn’t even needed at all. This gave a further performance boost: 0.08% geomean.
However, there was still the problem with using SmallPtrSet
: it has an undefined iteration order. This
means that if issues occur in some build where combines are not done in the right order, there may be a chance that it
is non-reproducible, which is really not fun. So, my mentor suggested changing the type of the pruning list to
SmallVector
, which would have a defined iteration order. To stop the iterators from getting invalidated,
my mentor advised me to just get rid of the code which deleted from the pruning list: this did work pretty well,
and also gave a really good speed boost:
0.42% geomean.
This was a great result, but it came with one problem: because deleted nodes (which could be caused by other combines
deleting unneeded/temporary nodes) could be reused by SelectionDAG::getNode
, what could happen is that
while iterating over the vector, a DELETED_NODE
added to the pruning list but not deleted could be reused
to create a new node, and if this new node had no uses, it would be automatically deleted.
The fix for this is simple: just do a linear scan over the vector to remove deleted nodes from it when it is not
being cleared. This involves adding one boolean to the DAGCombiner
class to track when the pruning list
is being cleared. This gives a very slight performance regression:
0.02% geomean.
However, this solution requires doing a linear scan over the vector, which may lead to an issue with compile time
explosion when the vector gets unexpectedly large. While this didn’t happen for us, like with issue #33910 it is
possible that a minor edge case can blow everything up. So, my mentor had an even better idea: just make
SmallSetVector
faster, as it does a set lookup instead of a linear scan.
My mentor suggested that SmallSetVector
can be made much better if it actually worked like the other
Small-
containers, where they do a linear scan for small sizes and a set lookup for bigger sizes.
Implementing this was not that hard to be honest, the problem came from the compile times: because this header is
included pretty much everywhere, every little change required recompiling EVERYTHING. This was slightly irritating,
to say the least. Another major hurdle (the only one to be honest) was that there were some types which did not
implement operator==
for their types, only getHashValue
for the DenseSet
, so
they could not be used in the small representation as this made the linear scan break.
So, to fix this, I wrote a TMP monstrosity to detect if a type had operator==
implemented for it, and
used it to detect if a type could be used for the small representation. This change gave a really really good
performance boost:
0.36% geomean.
However, it involved a lot of code reimplementation between SmallSetVector
and SetVector
.
To clean this up, my mentor suggested sending the parameter N
to SetVector
itself, which
would then allow reusing all the code in the base class. This gave a really clean diff: its just a few
if constexpr
blocks on top of the already existing code. This also gave a much better solution to
deciding if a type could be used in the small representation or not: just check if the small size N
is
not equal to 0, where 0 is the default small size which means the small representation is not used at all.
This gave an even greater performance boost: 0.42% geomean.
I still feel that it is better to decide the value for N
ourselves if it is not used, however this works
as well. Next, my mentor found uses of SetVector
where SmallSetVector
would be better now
that the small representation of the set type isn’t used, so I went and updated some places that used
SetVector
to use SmallSetVector
. This gave a good performance boost on
-O0
:
0.18% geomean.
This was much more fruitful than I could’ve asked for, and I’m really happy I was able to find this issue with the help of my mentor because it was getting pretty demoralizing not getting improvements for a week straight haha.