GSoC 2023 Week 13 and 14 - Small Experiments
14 Aug 2023This blog post is related to my GSoC 2023 Project.
These two weeks, I worked on some random experiments, of which 3 had useful results. These were mostly done on the
branches perf/instcombine4 and perf/instcombine5. While the commits on
both branches are functionally the same, the perf/instcombine5 branch (naturally) contains the better / more updated
code which is far more likely to be made into patches.
On the perf/instcombine4 branch, I made the following commits:
-
Replacing a slow O(n) lookup with a
DenseMap: Within the functionTargetLibraryInfo::getLibFunc, there is a lookup to the static, compile-time constant arrayStandardNamesto find the index of the queried library function. These can be functions likegetc,putc,printfetc. The problem with the old approach was that it was calling the functionstd::lower_boundto find the element, which would be much slower than just using a map lookup.So, I updated the code to use a
DenseMapand that gave a decent 0.24% speedup. -
Remove calls to
computeKnownBitsfor non-intrinsicCallInstinstructions: This was a minor thing, a continuation of the work done on moving handling forLoadInstintoisKnownNonZeroFromOperatorto directly check the presence of range metadata. Similarly, this commit moves the handling for “getReturnedArgOperand” fromcomputeKnownBitsintoisKnownNonZeroFromOperator, which gives a slight 0.06% speedup. -
Reduce calls to
computeKnownBitsinvisitAddforInstCombine: This commit aimed to reduce thecomputeKnownBitscalls incomputeConstantRangeIncludingKnownBitsby passing theKnownBitsinformation computed inhaveNoCommonBitsSetthrough towillNotOverflowSignedAdd. I implemented this by simply adding overloads for the required functions that takeKnownBitsparameters wherever required.This gave me a better speedup than I expected, 0.15%.
On the perf/instcombine5 branch, I made a few changes to the last commit on feedback from my mentor, that being the
creation of a new class which would lazily compute the KnownBits information instead of piping it through the various
functions so that no new overloads would be needed. This class ended up taking on the name CachedBitsValue, which is
still a work in progress. It is a fairly simple implementation, which stores a Value * and the
KnownBits information associated with it. This information is computed as-required by calling getKnownBits, and the
computed value is cached within the class.
Unfortunately, to allow implicit creation from a Value *, any parameter that accepted either a Value * or a
const Value * needs to be updated to use a const CachedBitsConstValue & or a const CachedBitsNonConstValue &, to
allow implicit creation of a temporary. What this means is that both the KnownBits member and the boolean member that
signifies its presence need to be made mutable so that they can be modified from const objects. This is quite an ugly
solution, but it comes from the fact that C++ doesn’t allow you to implicitly create an object that binds to a non-const
l-value reference.
With this in place however, the code for passing through KnownBits information becomes quite clean. This is shown by
the first commit below, where both of these commits are on the perf/instcombine5 branch:
-
Reduce
computeKnownBitscalls: This is a rebase of the previous commit, with the overloads replaced by usage ofCachedBitsValueand some extra work done toisKnownNonZero. This gives a slower speedup of 0.1%. This is most likely because of theisKnownNonZerocode causing a regression, however I will have to investigate further to know for sure. -
Reduce
getLibFunccalls: This commit aimed to reduce duplicate calls togetLibFuncthat I found withinFortifiedLibCallSimplifier. This builds upon the first commit from the previous branch, and I think further work can be done here to reduce the number of calls even more. However, with the optimization introduced by usingDenseMap, this doesn’t really do much and gives a negligible difference.
I also tried reducing the number of computeKnownBits calls in simplifyICmpInst. I noticed that visitICmpInst was
calling foldICmpUsingKnownBits, which computes known bits information. Before that however, simplifyICmpInst calls
simplifyICmpWithZero which ends up calling computeKnownBits a few times, a redundant call which can be handled by
reusing the information computed before in foldICmpUsingKnownBits. This is incidentally where the change to
simplifyICmpWithZero comes from in the above commit using CachedBitsValue, however it unfortunately ended
up not giving a measurable improvement at all. It also led to three test regressions, which led to me abandoning the
patch.