GSoC 2023 Week 2 and 3 - The KnownBits Rabbit Hole

This blog post is related to my GSoC 2023 Project.

These past two weeks, I worked on the following two issues:

The reason that I did not write a post for the previous week and have instead combined it into the one for week 3 is because there really was not anything to write about. While digging into the SimplifyDemandedBits function, I spent a lot of time trying to understand what really went wrong and why the output from that function was not matching the DAGCombiner.

So, why the difference? As it turns out, the implementation of computeKnownBits (which DAGCombiner) uses does not match SimplifyDemandedBits. It is supposed to be the case that at all Depth values, the output from the two functions should exactly equal each other, however this is not the case. This mostly seems to occur from differences between the cases for the following three instructions:

However, I do not unfortunately have enough experience with the code to know exactly what the differences are. This is just what I noted from comparing the code for both of the functions. It seems in SimplifyDemandedBits that the implementation just bails on computing the known bits for the above cases when the function getValidShiftAmountConstant returns nullptr.

I did attempt a fix for this, by adding an else case for that condition, and then calling SimplifyDemandedBits within that block to compute the known bits, however that just caused more test cases to fail. What I also found strange were the cases where target-dependent instructions were not having the same known bits computed across both the functions, for example the instruction PSHUFD was causing a check comparing the outputs of computeKnownBits and SimplifyDemandedBits to fail.

This one really perplexed me, and the issue may have been caused by one of the operands not having its known bits correctly calculated, however I did not really look into it. Anyways, there was not a lot to gain from this: I pushed the (incorrect) code to the compile time tracker, and the gain was only 0.10%. So, I decided to give up on the issue.

I then worked on another issue, also in DAGCombiner: my mentor noticed that the insertion of data in PruningList was using up too much time in the backed (sometimes upto 2% of total time), and this presented a great opportunity to make that better. The pruning list is used to accumulate DAG nodes for clean up after they have possibly been optimized by the various transforms. Basically, if a node in the pruning list has no uses, it is deleted from the DAG. This is possible either when the node has no original uses or when all the nodes that were using it get deleted as well.

When forming the initial worklist, all nodes are added to the pruning list as well. However, this does not make sense intuitively as initially only the nodes which have no uses will be deleted, as no optimizations have been run. Therefore it only makes sense to first push those nodes which have no uses when forming the initial worklist and then proceeding as normal where all nodes regardless of number of uses are pushed to the pruning list as well when they are added to the worklist.

This makes sense because it is possible that a node is added to the pruning list when one of its parent nodes are deleted. Then all the nodes that use this node also get deleted, before this node gets processed. If nodes are only inserted to the pruning list when their use count drops to zero, this situation is not accounted for. This may be one of the reasons why I had some test cases failing when I only inserted nodes to the pruning list when their use count was zero.

So, the following patch was created: https://reviews.llvm.org/D151416. It adds a parameter IsCandidateForPruning to the function AddToWorklist, which controls whether or not the node is added to the pruning list. To avoid changing the many call sites that are there, this is given the default value of true. This change nets a pretty good compile time gain: 0.49% geomean.

There is further work that can be done on this as the pruning list still uses a fair amount of time: 0.42% in the profile I have right now. However, the law of diminishing returns kicks in here, and there are many other places in the compiler which are great for optimization. Thus, I will try to find new places next.