GSoC 2023 Week 13 and 14 - Small Experiments
14 Aug 2023This blog post is related to my GSoC 2023 Project.
These two weeks, I worked on some random experiments, of which 3 had useful results. These were mostly done on the
branches perf/instcombine4
and perf/instcombine5
. While the commits on
both branches are functionally the same, the perf/instcombine5
branch (naturally) contains the better / more updated
code which is far more likely to be made into patches.
On the perf/instcombine4
branch, I made the following commits:
-
Replacing a slow O(n) lookup with a
DenseMap
: Within the functionTargetLibraryInfo::getLibFunc
, there is a lookup to the static, compile-time constant arrayStandardNames
to find the index of the queried library function. These can be functions likegetc
,putc
,printf
etc. The problem with the old approach was that it was calling the functionstd::lower_bound
to find the element, which would be much slower than just using a map lookup.So, I updated the code to use a
DenseMap
and that gave a decent 0.24% speedup. -
Remove calls to
computeKnownBits
for non-intrinsicCallInst
instructions: This was a minor thing, a continuation of the work done on moving handling forLoadInst
intoisKnownNonZeroFromOperator
to directly check the presence of range metadata. Similarly, this commit moves the handling for “getReturnedArgOperand
” fromcomputeKnownBits
intoisKnownNonZeroFromOperator
, which gives a slight 0.06% speedup. -
Reduce calls to
computeKnownBits
invisitAdd
forInstCombine
: This commit aimed to reduce thecomputeKnownBits
calls incomputeConstantRangeIncludingKnownBits
by passing theKnownBits
information computed inhaveNoCommonBitsSet
through towillNotOverflowSignedAdd
. I implemented this by simply adding overloads for the required functions that takeKnownBits
parameters wherever required.This gave me a better speedup than I expected, 0.15%.
On the perf/instcombine5
branch, I made a few changes to the last commit on feedback from my mentor, that being the
creation of a new class which would lazily compute the KnownBits
information instead of piping it through the various
functions so that no new overloads would be needed. This class ended up taking on the name CachedBitsValue
, which is
still a work in progress. It is a fairly simple implementation, which stores a Value *
and the
KnownBits
information associated with it. This information is computed as-required by calling getKnownBits
, and the
computed value is cached within the class.
Unfortunately, to allow implicit creation from a Value *
, any parameter that accepted either a Value *
or a
const Value *
needs to be updated to use a const CachedBitsConstValue &
or a const CachedBitsNonConstValue &
, to
allow implicit creation of a temporary. What this means is that both the KnownBits
member and the boolean member that
signifies its presence need to be made mutable
so that they can be modified from const objects. This is quite an ugly
solution, but it comes from the fact that C++ doesn’t allow you to implicitly create an object that binds to a non-const
l-value reference.
With this in place however, the code for passing through KnownBits
information becomes quite clean. This is shown by
the first commit below, where both of these commits are on the perf/instcombine5
branch:
-
Reduce
computeKnownBits
calls: This is a rebase of the previous commit, with the overloads replaced by usage ofCachedBitsValue
and some extra work done toisKnownNonZero
. This gives a slower speedup of 0.1%. This is most likely because of theisKnownNonZero
code causing a regression, however I will have to investigate further to know for sure. -
Reduce
getLibFunc
calls: This commit aimed to reduce duplicate calls togetLibFunc
that I found withinFortifiedLibCallSimplifier
. This builds upon the first commit from the previous branch, and I think further work can be done here to reduce the number of calls even more. However, with the optimization introduced by usingDenseMap
, this doesn’t really do much and gives a negligible difference.
I also tried reducing the number of computeKnownBits
calls in simplifyICmpInst
. I noticed that visitICmpInst
was
calling foldICmpUsingKnownBits
, which computes known bits information. Before that however, simplifyICmpInst
calls
simplifyICmpWithZero
which ends up calling computeKnownBits
a few times, a redundant call which can be handled by
reusing the information computed before in foldICmpUsingKnownBits
. This is incidentally where the change to
simplifyICmpWithZero
comes from in the above commit using CachedBitsValue
, however it unfortunately ended
up not giving a measurable improvement at all. It also led to three test regressions, which led to me abandoning the
patch.