GSoC 2023 Week 5 - SetVector Speedup

This blog post is related to my GSoC 2023 Project.

This week, I was able to get some pretty good results: 0.4% geomean, just by changing the type of PruningList. This also led to finding major performance improvements in SetVector.

While looking through the DAGCombiner code, I noticed one thing: the pruning list stores pointers, but it doesn’t use the standard pointer container: SmallPtrSet. At the time, I didn’t know that the insertion order into pruning list matters as much as the worklist, this is because operands that do have uses of nodes which get pruned get put into the worklist to be processed.

Thus, if the iteration order over the pruning list is not well-defined, the order of combinations done by the DAGCombiner also become undefined, which is obviously not good. This is the issue that is caused by using SmallPtrSet. I still think the solution I came up with is quite interesting, so I’ll explain it a little.

When I switched to using only one SmallPtrSet, the problem arose that entries may be added to the pruning list while pruning is being done. The solution that I had come up with involved going over the whole pruning list without deleting anything, then clearing it at the end. However, as nodes could be added to it while iterating, this would invalidate the iterators and cause a crash when iterating. So, I had to somehow prevent the iterators from being invalidated while I iterated over them.

The solution that I came up with for this was: create two pruning lists, one which stores the nodes which are currently being pruned, and one which stores the nodes which get added by calls to recursivelyDeleteUnusedNodes. Then, a pointer is used to decided which list is being modified currently by calls to RemoveFromWorklist. I like to compare this system to a double buffer on a GPU, where one buffer stores the frame currently being shown (front buffer, i.e. the pruning list which is currently being cleared), and one buffer where the actual rendering work is done (back buffer, i.e. the pruning list where nodes are being added).

Then, once the back buffer is ready to be shown, the pointers are flipped and the back buffer becomes the front buffer and vice versa. I did the same thing with the pointer, once one list was empty the other was set to be the one which was to be emptied, and it kept flip-flopping this way until both lists were empty. This gave a pretty damn good result: 0.21% geomean.

Of course, my mentor came up with a much better solution: why not just not modify the pruning list while it is being cleared? This made a lot of sense, the only nodes which get added to the pruning list while it is being cleared are those which have uses, so they won’t be pruned anyways. So, by stopping that from happening, the iterators are not invalidated and the whole flip flopping system isn’t even needed at all. This gave a further performance boost: 0.08% geomean.

However, there was still the problem with using SmallPtrSet: it has an undefined iteration order. This means that if issues occur in some build where combines are not done in the right order, there may be a chance that it is non-reproducible, which is really not fun. So, my mentor suggested changing the type of the pruning list to SmallVector, which would have a defined iteration order. To stop the iterators from getting invalidated, my mentor advised me to just get rid of the code which deleted from the pruning list: this did work pretty well, and also gave a really good speed boost: 0.42% geomean.

This was a great result, but it came with one problem: because deleted nodes (which could be caused by other combines deleting unneeded/temporary nodes) could be reused by SelectionDAG::getNode, what could happen is that while iterating over the vector, a DELETED_NODE added to the pruning list but not deleted could be reused to create a new node, and if this new node had no uses, it would be automatically deleted.

The fix for this is simple: just do a linear scan over the vector to remove deleted nodes from it when it is not being cleared. This involves adding one boolean to the DAGCombiner class to track when the pruning list is being cleared. This gives a very slight performance regression: 0.02% geomean.

However, this solution requires doing a linear scan over the vector, which may lead to an issue with compile time explosion when the vector gets unexpectedly large. While this didn’t happen for us, like with issue #33910 it is possible that a minor edge case can blow everything up. So, my mentor had an even better idea: just make SmallSetVector faster, as it does a set lookup instead of a linear scan.

My mentor suggested that SmallSetVector can be made much better if it actually worked like the other Small- containers, where they do a linear scan for small sizes and a set lookup for bigger sizes. Implementing this was not that hard to be honest, the problem came from the compile times: because this header is included pretty much everywhere, every little change required recompiling EVERYTHING. This was slightly irritating, to say the least. Another major hurdle (the only one to be honest) was that there were some types which did not implement operator== for their types, only getHashValue for the DenseSet, so they could not be used in the small representation as this made the linear scan break.

So, to fix this, I wrote a TMP monstrosity to detect if a type had operator== implemented for it, and used it to detect if a type could be used for the small representation. This change gave a really really good performance boost: 0.36% geomean.

However, it involved a lot of code reimplementation between SmallSetVector and SetVector. To clean this up, my mentor suggested sending the parameter N to SetVector itself, which would then allow reusing all the code in the base class. This gave a really clean diff: its just a few if constexpr blocks on top of the already existing code. This also gave a much better solution to deciding if a type could be used in the small representation or not: just check if the small size N is not equal to 0, where 0 is the default small size which means the small representation is not used at all.

This gave an even greater performance boost: 0.42% geomean.

I still feel that it is better to decide the value for N ourselves if it is not used, however this works as well. Next, my mentor found uses of SetVector where SmallSetVector would be better now that the small representation of the set type isn’t used, so I went and updated some places that used SetVector to use SmallSetVector. This gave a good performance boost on -O0: 0.18% geomean.

This was much more fruitful than I could’ve asked for, and I’m really happy I was able to find this issue with the help of my mentor because it was getting pretty demoralizing not getting improvements for a week straight haha.