GSoC 2023 Week 2 and 3 - The KnownBits Rabbit Hole
25 May 2023
This blog post is related to my GSoC 2023 Project.
These past two weeks, I worked on the following two issues:
-
Moving a transform of an
ADD
to anOR
from theDAGCombiner
to theTargetLowering::SimplifyDemandedBits
function. -
Reducing the number of entries inserted into the
PruningList
of theDAGCombiner
when the worklist is initially created.
The reason that I did not write a post for the previous week and have instead combined it into the one for week 3 is
because there really was not anything to write about. While digging into the SimplifyDemandedBits
function, I spent a lot of time trying to understand what really went wrong and why the output from that function was
not matching the DAGCombiner
.
So, why the difference? As it turns out, the implementation of computeKnownBits
(which
DAGCombiner
) uses does not match SimplifyDemandedBits
. It is supposed to be the case that at
all Depth
values, the output from the two functions should exactly equal each other, however this is not
the case. This mostly seems to occur from differences between the cases for the following three instructions:
-
ISD::SHL
-
ISD::SRL
-
ISD::SRA
However, I do not unfortunately have enough experience with the code to know exactly what the differences are. This is
just what I noted from comparing the code for both of the functions. It seems in SimplifyDemandedBits
that the implementation just bails on computing the known bits for the above cases when the function
getValidShiftAmountConstant
returns nullptr
.
I did attempt a fix for this, by adding an else
case for that condition, and then calling
SimplifyDemandedBits
within that block to compute the known bits, however that just caused more test
cases to fail. What I also found strange were the cases where target-dependent instructions were not having the same
known bits computed across both the functions, for example the instruction PSHUFD
was causing a check
comparing the outputs of computeKnownBits
and SimplifyDemandedBits
to fail.
This one really perplexed me, and the issue may have been caused by one of the operands not having its known bits correctly calculated, however I did not really look into it. Anyways, there was not a lot to gain from this: I pushed the (incorrect) code to the compile time tracker, and the gain was only 0.10%. So, I decided to give up on the issue.
I then worked on another issue, also in DAGCombiner
: my mentor noticed that the insertion of data in
PruningList
was using up too much time in the backed (sometimes upto 2% of total time), and this
presented a great opportunity to make that better. The pruning list is used to accumulate DAG nodes for clean up after
they have possibly been optimized by the various transforms. Basically, if a node in the pruning list has no uses, it
is deleted from the DAG. This is possible either when the node has no original uses or when all the nodes that were
using it get deleted as well.
When forming the initial worklist, all nodes are added to the pruning list as well. However, this does not make sense intuitively as initially only the nodes which have no uses will be deleted, as no optimizations have been run. Therefore it only makes sense to first push those nodes which have no uses when forming the initial worklist and then proceeding as normal where all nodes regardless of number of uses are pushed to the pruning list as well when they are added to the worklist.
This makes sense because it is possible that a node is added to the pruning list when one of its parent nodes are deleted. Then all the nodes that use this node also get deleted, before this node gets processed. If nodes are only inserted to the pruning list when their use count drops to zero, this situation is not accounted for. This may be one of the reasons why I had some test cases failing when I only inserted nodes to the pruning list when their use count was zero.
So, the following patch was created: https://reviews.llvm.org/D151416.
It adds a parameter IsCandidateForPruning
to the function AddToWorklist
, which controls
whether or not the node is added to the pruning list. To avoid changing the many call sites that are there, this is
given the default value of true
. This change nets a pretty good compile time gain:
0.49% geomean.
There is further work that can be done on this as the pruning list still uses a fair amount of time: 0.42% in the profile I have right now. However, the law of diminishing returns kicks in here, and there are many other places in the compiler which are great for optimization. Thus, I will try to find new places next.