GSoC 2023 Week 17 and 18 - Same Old, Same Old

10 Sep 2023

This blog post is related to my GSoC 2023 Project.

These two weeks, I worked on finding some more random optimization patches and also fixed #88580.

Regarding #88580, the fix was quite simple: the problem was being caused by the fact that SROA was creating too many partitions for the alloca. This meant that for a single alloca, it was possible that tens of thousands of loads or stores could be generated, in which case a compile time explosion would occur in passes like DeadStoreElimination and MemCpyOpt.

The way that SROA works is that it takes as its input an alloca instruction, and then tries to split it into multiple (hopefully smaller) allocas. The terminology is that an alloca is split into “splices”, and these splices further consist of “partitions”. From what I understand, a partition is basically an indivisible unit of the alloca, for example a single (i.e. scalar) element.

So, to fix this, it was quite simple to just add a CLI option to limit the number of slices that alloca is allowed to generate (calculating partitions was harder and not really worth the effort, though it would be a more accurate measurement). This change landed here.

After this, I have gone back to trying to find more performance improvements. I have found some more, though progress has been really slow:

Manually unroll a loop involving MCRegUnitRootIterator: 0.15% speedup
EDIT

I got this completely wrong. This code was optimizing perfectly fine, my changes completely broke the way the code worked. Earlier, the code looked like this:
```
void foo() {
  for (...)
    for (
       MCRegUnitRootIterator RootReg(U, TRI);
       RootReg.isValid();
       ++RootReg
    )
      if (...) {
        Units.reset(U);
        break;
      }
}
```
After my changes, the code looked like this:
```
void foo() {
  for (...) {
    MCRegUnitRootIterator RootReg(U, TRI);
    if (RootReg.isValid()) {
      if (...) {
        Units.reset(U);
        break;
      }
      ++RootReg;
    }
    if (RootReg.isValid()) {
      if (...) {
        Units.reset(U);
        break;
      }
    }
  }
}
```
This is a very nasty bug. What I had done here is removed the inner for- loop, but not the break statements within it. What this in turn meant was that every time the if-statement condition would be true, it would break out of the outer loop instead of continuing to execute it.

So, it would break from the loop on the first regunit that met the condition instead of checking all of them. This is what actually gave the performance “improvement”, instead of any real optimization.
This is something I discovered when I was going through profiles for the ReachingDefAnalysis pass and I noticed LiveRegUnits::stepBackward was taking quite a lot of time in that. After digging through it a bit, I found out that the culprit was LiveRegUnits::removeRegsNotPreserved, a simple function that just has two loops within it. I initially assumed that the issue was being caused by the usage of a BitVector, but interestingly after digging even further it turned out to be caused by MCRegUnitRootIterator.

MCRegUnitRootIterator is a simple iterator type that contains two integers to represent the root registers of a regunit. Only two integers are used because a regunit is only allowed to have two root registers. The type tries to be fancy and “iterates” over the registers by shuffling the value from the second register to the first on increment and setting the second register to an invalid value. Basically,
```
struct MCRegUnitRootIterator {
  int Reg0, Reg1;

  bool isValid()    { return Reg0 != INVALID; }
  void operator++() { Reg0 = Reg1; Reg1 = INVALID; }
}
```
What this means is that only two iterations will ever occur because on the first, Reg0s value will be used and on the second, Reg1s value will be used (through Reg0), following which Reg0 will be set to INVALID and iteration will stop.

However, the compiler is not smart enough to see through this completely and ends up not optimizing this data structure very well. By manually unrolling the loop, the compiler knows that the iterator will only advance once when valid and will then advance into an invalid state. So, it is able to optimize it much better.
Change a TinyPtrVector to a SmallVector: 0.1% speedup

This one is quite simple, I noticed that the ReachingDefAnalysis pass was spending quite some time in appending to vectors, so I was playing around with changing the types and landed on this patch. Unfortunately, this gives a disproportionately large change in the memory usage: an increase of 0.7%, which means the patch is not really usable at all.
Replace an O(n) lookup with a DenseSet: 0.1% speedup on LTO

This one is another case of linear-time lookup on a static table, much the same as in D157951. In this case, it occurs in the IR symbol table builder, when it checks for the presence of “preserved symbols”. These are stored in a table of const char *, which is then looked up through llvm::is_contained. The patch for this is the exact same as D157951, just using a DenseSet instead of a DenseMap (because only existence needs to be checked).
Inline getNumBytesForUTF8 into ConvertUTF.h: 0.1% regression on -O0 -g

This is a function I have been seeing for a while in X86AsmPrinter using up a lot of time, where it is just a simple one-liner that looks up a table and adds one to the result. Because it is present in a .cpp file, it cannot be inlined, and ends up wasting a lot of time on call/ret instructions (in non-LTO builds).

I noticed quite a huge uplift when profiling with llc, specifically on compiling bitcode files at -O0, however this was strangely not reproduced on the compile time tracker. Given that this is used in AsmPrinter, I would expect it to affect any build configuration, so some deeper inspection is required here.

This patch is basically the same as D152781.

The InferAlignment patches should land soon, only the review for the last patch is left now. In the meantime, I will continue trying to find perf patches like these to hopefully get a cumulative speedup of 0.6% on -O3, which should (again, hopefully) get my project to a 3% overall speedup. I am also concurrently working on my final report at this point, and it has been very fun to go all-out designer mode on what is basically a word document :)

dc03

GSoC 2023 Week 17 and 18 - Same Old, Same Old

Related Posts

GSoC 2023 Week 19 and 20 - That's All, Folks! 29 Sep 2023

GSoC 2023 Week 15 and 16 29 Aug 2023

GSoC 2023 Week 13 and 14 - Small Experiments 14 Aug 2023

GSoC 2023 Week 19 and 20 - That's All, Folks!
29 Sep 2023

GSoC 2023 Week 15 and 16
29 Aug 2023

GSoC 2023 Week 13 and 14 - Small Experiments
14 Aug 2023