GSoC 2023 Week 2 and 3 - The KnownBits Rabbit Hole

25 May 2023

This blog post is related to my GSoC 2023 Project.

These past two weeks, I worked on the following two issues:

Moving a transform of an ADD to an OR from the DAGCombiner to the TargetLowering::SimplifyDemandedBits function.
Reducing the number of entries inserted into the PruningList of the DAGCombiner when the worklist is initially created.

The reason that I did not write a post for the previous week and have instead combined it into the one for week 3 is because there really was not anything to write about. While digging into the SimplifyDemandedBits function, I spent a lot of time trying to understand what really went wrong and why the output from that function was not matching the DAGCombiner.

So, why the difference? As it turns out, the implementation of computeKnownBits (which DAGCombiner) uses does not match SimplifyDemandedBits. It is supposed to be the case that at all Depth values, the output from the two functions should exactly equal each other, however this is not the case. This mostly seems to occur from differences between the cases for the following three instructions:

ISD::SHL
ISD::SRL
ISD::SRA

However, I do not unfortunately have enough experience with the code to know exactly what the differences are. This is just what I noted from comparing the code for both of the functions. It seems in SimplifyDemandedBits that the implementation just bails on computing the known bits for the above cases when the function getValidShiftAmountConstant returns nullptr.

I did attempt a fix for this, by adding an else case for that condition, and then calling SimplifyDemandedBits within that block to compute the known bits, however that just caused more test cases to fail. What I also found strange were the cases where target-dependent instructions were not having the same known bits computed across both the functions, for example the instruction PSHUFD was causing a check comparing the outputs of computeKnownBits and SimplifyDemandedBits to fail.

This one really perplexed me, and the issue may have been caused by one of the operands not having its known bits correctly calculated, however I did not really look into it. Anyways, there was not a lot to gain from this: I pushed the (incorrect) code to the compile time tracker, and the gain was only 0.10%. So, I decided to give up on the issue.

I then worked on another issue, also in DAGCombiner: my mentor noticed that the insertion of data in PruningList was using up too much time in the backed (sometimes upto 2% of total time), and this presented a great opportunity to make that better. The pruning list is used to accumulate DAG nodes for clean up after they have possibly been optimized by the various transforms. Basically, if a node in the pruning list has no uses, it is deleted from the DAG. This is possible either when the node has no original uses or when all the nodes that were using it get deleted as well.

When forming the initial worklist, all nodes are added to the pruning list as well. However, this does not make sense intuitively as initially only the nodes which have no uses will be deleted, as no optimizations have been run. Therefore it only makes sense to first push those nodes which have no uses when forming the initial worklist and then proceeding as normal where all nodes regardless of number of uses are pushed to the pruning list as well when they are added to the worklist.

This makes sense because it is possible that a node is added to the pruning list when one of its parent nodes are deleted. Then all the nodes that use this node also get deleted, before this node gets processed. If nodes are only inserted to the pruning list when their use count drops to zero, this situation is not accounted for. This may be one of the reasons why I had some test cases failing when I only inserted nodes to the pruning list when their use count was zero.

So, the following patch was created: https://reviews.llvm.org/D151416. It adds a parameter IsCandidateForPruning to the function AddToWorklist, which controls whether or not the node is added to the pruning list. To avoid changing the many call sites that are there, this is given the default value of true. This change nets a pretty good compile time gain: 0.49% geomean.

There is further work that can be done on this as the pruning list still uses a fair amount of time: 0.42% in the profile I have right now. However, the law of diminishing returns kicks in here, and there are many other places in the compiler which are great for optimization. Thus, I will try to find new places next.

dc03

GSoC 2023 Week 2 and 3 - The KnownBits Rabbit Hole

Related Posts

GSoC 2023 Week 19 and 20 - That's All, Folks! 29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old 10 Sep 2023

GSoC 2023 Week 15 and 16 29 Aug 2023

GSoC 2023 Week 19 and 20 - That's All, Folks!
29 Sep 2023

GSoC 2023 Week 17 and 18 - Same Old, Same Old
10 Sep 2023

GSoC 2023 Week 15 and 16
29 Aug 2023