GSoC 2023 Week 17 and 18 - Same Old, Same Old
10 Sep 2023This blog post is related to my GSoC 2023 Project.
These two weeks, I worked on finding some more random optimization patches and
also fixed #88580
.
Regarding #88580
, the fix was quite simple: the problem was being caused by
the fact that SROA was creating too many partitions for the alloca. This meant
that for a single alloca, it was possible that tens of thousands of loads or
stores could be generated, in which case a compile time explosion would occur in
passes like DeadStoreElimination
and MemCpyOpt
.
The way that SROA works is that it takes as its input an alloca instruction, and then tries to split it into multiple (hopefully smaller) allocas. The terminology is that an alloca is split into “splices”, and these splices further consist of “partitions”. From what I understand, a partition is basically an indivisible unit of the alloca, for example a single (i.e. scalar) element.
So, to fix this, it was quite simple to just add a CLI option to limit the number of slices that alloca is allowed to generate (calculating partitions was harder and not really worth the effort, though it would be a more accurate measurement). This change landed here.
After this, I have gone back to trying to find more performance improvements. I have found some more, though progress has been really slow:
-
Manually unroll a loop involving
MCRegUnitRootIterator
: 0.15% speedupEDIT
I got this completely wrong. This code was optimizing perfectly fine, my changes completely broke the way the code worked. Earlier, the code looked like this:
void foo() { for (...) for ( MCRegUnitRootIterator RootReg(U, TRI); RootReg.isValid(); ++RootReg ) if (...) { Units.reset(U); break; } }
After my changes, the code looked like this:
void foo() { for (...) { MCRegUnitRootIterator RootReg(U, TRI); if (RootReg.isValid()) { if (...) { Units.reset(U); break; } ++RootReg; } if (RootReg.isValid()) { if (...) { Units.reset(U); break; } } } }
This is a very nasty bug. What I had done here is removed the inner
for
- loop, but not thebreak
statements within it. What this in turn meant was that every time theif
-statement condition would be true, it would break out of the outer loop instead of continuing to execute it.So, it would break from the loop on the first regunit that met the condition instead of checking all of them. This is what actually gave the performance “improvement”, instead of any real optimization.
This is something I discovered when I was going through profiles for the
ReachingDefAnalysis
pass and I noticedLiveRegUnits::stepBackward
was taking quite a lot of time in that. After digging through it a bit, I found out that the culprit wasLiveRegUnits::removeRegsNotPreserved
, a simple function that just has two loops within it. I initially assumed that the issue was being caused by the usage of aBitVector
, but interestingly after digging even further it turned out to be caused byMCRegUnitRootIterator
.MCRegUnitRootIterator
is a simple iterator type that contains two integers to represent the root registers of a regunit. Only two integers are used because a regunit is only allowed to have two root registers. The type tries to be fancy and “iterates” over the registers by shuffling the value from the second register to the first on increment and setting the second register to an invalid value. Basically,struct MCRegUnitRootIterator { int Reg0, Reg1; bool isValid() { return Reg0 != INVALID; } void operator++() { Reg0 = Reg1; Reg1 = INVALID; } }
What this means is that only two iterations will ever occur because on the first,
Reg0
s value will be used and on the second,Reg1
s value will be used (throughReg0
), following whichReg0
will be set toINVALID
and iteration will stop.However, the compiler is not smart enough to see through this completely and ends up not optimizing this data structure very well. By manually unrolling the loop, the compiler knows that the iterator will only advance once when valid and will then advance into an invalid state. So, it is able to optimize it much better.
-
Change a
TinyPtrVector
to aSmallVector
: 0.1% speedupThis one is quite simple, I noticed that the
ReachingDefAnalysis
pass was spending quite some time in appending to vectors, so I was playing around with changing the types and landed on this patch. Unfortunately, this gives a disproportionately large change in the memory usage: an increase of 0.7%, which means the patch is not really usable at all. -
Replace an O(n) lookup with a
DenseSet
: 0.1% speedup on LTOThis one is another case of linear-time lookup on a static table, much the same as in D157951. In this case, it occurs in the IR symbol table builder, when it checks for the presence of “preserved symbols”. These are stored in a table of
const char *
, which is then looked up throughllvm::is_contained
. The patch for this is the exact same asD157951
, just using aDenseSet
instead of aDenseMap
(because only existence needs to be checked). -
Inline
getNumBytesForUTF8
intoConvertUTF.h
: 0.1% regression on -O0 -gThis is a function I have been seeing for a while in
X86AsmPrinter
using up a lot of time, where it is just a simple one-liner that looks up a table and adds one to the result. Because it is present in a.cpp
file, it cannot be inlined, and ends up wasting a lot of time oncall
/ret
instructions (in non-LTO builds).I noticed quite a huge uplift when profiling with
llc
, specifically on compiling bitcode files at-O0
, however this was strangely not reproduced on the compile time tracker. Given that this is used inAsmPrinter
, I would expect it to affect any build configuration, so some deeper inspection is required here.This patch is basically the same as D152781.
The InferAlignment
patches should land soon, only the review for the last
patch is left now. In the meantime, I will continue trying to find perf patches
like these to hopefully get a cumulative speedup of 0.6% on -O3
, which should
(again, hopefully) get my project to a 3% overall speedup. I am also
concurrently working on my final report at this point, and it has been very fun
to go all-out designer mode on what is basically a word document :)