Lmst

#aoco2025

[ #Compiler ] Day 3 of #AoCO2025 Study Notes: You can’t fool the optimiser

My notes focus on reproducing and verifying Matt Godbolt’s teaching within a local development environment

This post specifically compares Tail Recursion vs. Standard Recursion.

Read more here: https://gapry.github.io/2026/01/31/Advent-of-Compiler-Optimisations-Study-Notes-03.html

[ #Compiler ] Day 2 of #AoCO2025 Study Notes

My notes focus on reproducing and verifying Matt Godbolt’s teaching within a local development environment

Additionally, I have extended the discussion by implementing a manual PoC in assembly

Read more here: https://gapry.github.io/2026/01/31/Advent-of-Compiler-Optimisations-Study-Notes-02.html

[#Compiler] Day 1 of #AoCO2025 Study Notes

While the original uses #CompilerExplorer, I wanted to replicate the analysis locally.

In this post, I have used #gcc, #clang, llvm-objdump and #LLDB to analyze.

Read more here: https://gapry.github.io/2026/01/01/Advent-of-Compiler-Optimisations-Study-Notes-01.html

Day 25 of Advent of Compiler Optimisations!

We've reached the end of this journey through compiler magic—from simple arithmetic tricks to mind-bending loop transformations. Thank you for following along! Whether you celebrate Christmas or just enjoy a good compiler optimisation, I hope you've discovered something that made you see your code differently.

Read more: https://xania.org/202512/25-thank-you
Watch: https://youtu.be/N1sRfYwzmso

Day 24 of Advent of Compiler Optimisations!

A simple loop that sums integers from 0 to n. GCC cleverly unrolls it to process two numbers at once. But clang? The loop completely disappears—replaced by a few multiplies and shifts that compute the answer directly. How does it recognise this pattern and transform O(n) code into O(1)?

Read more: https://xania.org/202512/24-cunning-clang
Watch: https://youtu.be/V9dy34slaxA

Day 23 of Advent of Compiler Optimisations!

Switch statements compile to jump tables, right? Well... sometimes. But what happens when your five-case switch becomes pure arithmetic? Or when checking for whitespace turns into a single mysterious constant and some bit manipulation? Turns out compilers have a whole bag of tricks beyond the textbook answer.

Read more: https://xania.org/202512/23-switching-it-up
Watch: https://youtu.be/aSljdPafBAw

Day 22 of Advent of Compiler Optimisations!

Comparing a string_view against "ABCDEFG" should call memcmp, right? Watch what Clang actually generates — no function call at all, just a handful of inline instructions using some rather cunning tricks. How does it compare 7 bytes so efficiently when they don't fit in a single register?

Read more: https://xania.org/202512/22-memory-cunningness
Watch: https://youtu.be/kXmqwJoaapg

Day 21 of Advent of Compiler Optimisations!

Summing an array of integers? The compiler vectorises it beautifully, processing 8 at a time with SIMD. Switch to floats and... the compiler refuses to vectorise, doing each add one by one. Same loop, same code structure — why does the compiler treat floats so differently?

Read more: https://xania.org/202512/21-vectorising-floats
Watch: https://youtu.be/lUTvi_96-D8

Day 20 of Advent of Compiler Optimisations!

Loop over 65,536 integers doing comparisons — that's 65,536 iterations, right? Wrong! With the right flags, the compiler processes 8 integers per iteration using SIMD instructions. Same number of assembly instructions, 8× the throughput. What's the trick that makes this possible?

Read more: https://xania.org/202512/20-simd-city
Watch: https://youtu.be/d68x8TF7XJs

Day 19 of Advent of Compiler Optimisations!

Recursive functions need to call themselves over and over — that must mean unbounded stack growth, right? Wrong! When a function ends by calling another function (even itself), the compiler can replace the call with a simple jump. Recursion becomes iteration, no stack overhead at all. How does this transformation work?

Read more: https://xania.org/202512/19-tail-call-optimisation
Watch: https://youtu.be/J1vtP0QDLLU

Day 18 of Advent of Compiler Optimisations!

You have a function with a fast path and a slow path. Inline it everywhere? Massive code bloat. Don't inline? You miss the fast path performance gains. It's an impossible choice—or is it? The compiler finds a way to get the performance benefits of inlining without paying the full code size cost. But how?

Read more: https://xania.org/202512/18-partial-inlining
Watch: https://youtu.be/STZb5K5sPDs

Day 17 of Advent of Compiler Optimisations!

A function that handles both upper and lower case conversion. Call it with upper=true and the compiler inlines it — but something remarkable happens. The inlined code doesn't just avoid the function call overhead. Half the function completely vanishes! How does copy-pasting code make it disappear?

Read more: https://xania.org/202512/17-inlining-the-ultimate-optimisation
Watch: https://youtu.be/JFHfFTvMPp0

Day 16 of Advent of Compiler Optimisations!

Pass a function two separate arguments, or pack them in a struct — which is faster? The answer might surprise you: sometimes the struct version is MORE efficient! Eight char arguments as separate parameters spill to the stack, but pack them in a struct and they fit in a single register. How does the compiler pull this off?

Read more: https://xania.org/202512/16-calling-conventions
Watch: https://youtu.be/Yaw8AMoP4sI

Day 15 of Advent of Compiler Optimisations!

Two nearly identical loops: one accumulates ints into an int, the other accumulates ints into a long. You'd expect similar assembly—just different register sizes, right? Wrong! One loop writes to memory on every iteration, the other keeps everything in registers. Same algorithm, wildly different performance. What's going on?

Read more: https://xania.org/202512/15-aliasing-in-general
Watch: https://youtu.be/PPJtJzT2U04

Day 14 of Advent of Compiler Optimisations! 🎄

Yesterday we saw the compiler beautifully hoist strlen() out of our loop. Today? Add a single global counter and watch that optimisation vanish—strlen gets called EVERY iteration! But why would incrementing an unrelated variable break loop-invariant code motion? The answer involves a surprising rule about char* in the C++ standard.

Read more: https://xania.org/202512/14-licm-when-it-doesnt
Watch: https://youtu.be/OwFNblEEAXo

Day 13 of Advent of Compiler Optimisations!

You're calling a function inside a loop, but its result never changes between iterations. Does the compiler spot this and hoist it out? Turns out the answer depends on which compiler you use! Clang pulls off the optimisation beautifully, but gcc stumbles—even with explicit hints. What's going on?

Read more: https://xania.org/202512/13-licking-licm
Watch: https://youtu.be/dIwaqJG0WDo

Day 12 of Advent of Compiler Optimisations!

Your loop checks the same condition every iteration, even though it never changes. Seems wasteful, right? The compiler thinks so too—and its solution is something that sounds completely backwards. Making your code bigger to make it faster? What's the trick?

Read more: https://xania.org/202512/12-loop-unswitching
Watch: https://youtu.be/-VCrYshE7iQ

Day 11 of Advent of Compiler Optimisations!

A clever loop that counts set bits using the "clear bottom bit" trick: value &= value - 1. Works great, generates tight assembly. But change one compiler flag to target a slightly newer CPU and something extraordinary happens to your loop. The compiler spots a pattern you didn't even know was there. What replaces your careful bit manipulation?

Read more: https://xania.org/202512/11-pop-goes-the-weasel-er-count
Watch: https://youtu.be/Hu0vu1tpZnc

Day 10 of Advent of Compiler Optimisations!

A simple loop summing 8 values — but tell the compiler the count is fixed at compile time and watch what happens. The code transforms in surprising ways. What does "unrolling" actually look like in assembly, and when does the compiler decide it's worth it? Try changing the count and see how the strategy changes!

Read more: https://xania.org/202512/10-loop-unrolling
Watch: https://youtu.be/HvF3tF2efEA

Another fascinating article by Matt #Godbolt in his Advent of Compiler Optimization series. This one is about induction variables and loops:

https://xania.org/202512/09-induction-variables

Here, Matt uses the llvm-mca tool to visualize the x86 Haswell CPU pipeline to show loop-carried dependencies.

If you want to see this for RISC-V (SiFive U74) for one of the given examples, try this link (llvm-mca -march=riscv64 -mcpu=sifive-u74 -timeline):

https://aoco.compiler-explorer.com/z/9srhEGhsG

#compilers #optimization #risc_v #aoco2025

Client Info

Server: https://mastodon.social

Version: 2025.07

Repository: https://github.com/cyevgeniy/lmst