It did not come from nowhere. @xiyang set the tone, passing the torch to the current students. He epitomised technical excellence + intellectual generosity. Thank you, Xi š
Computer scientist, researcher, dad, husband, runner, hiker, collector of garbage.
It did not come from nowhere. @xiyang set the tone, passing the torch to the current students. He epitomised technical excellence + intellectual generosity. Thank you, Xi š
@wingo Thanks Andy.
Whatās in the water? Just a wonderfully talented and intellectually generous bunch of students. In particular thereās a great culture of generosity in the lab that @caizixian and other students have built. That culture encourages talented undergrads like Claire to join and flourish. Iām very grateful.
Lots of fun details in Huang's thesis: static vs dynamic prefetching (static is fine), computation of how much one could gain if cache-miss latency were eliminated, what the GC time would be if hardware prefetchers were disabled (20-80% slower; see attached figure); mutator time without prefetchers (sometimes it's better??!?); how to use "perf mem"; all good stuff!
I don't know what's in the water at ANU but they have been doing lots of great work at all levels recently
Claire Huang wrote an undergraduate honor's thesis, supervised by @steveblackburn and @caizixian https://www.steveblackburn.org/pubs/theses/huang-2025.pdf
She uses sampling PEBS counters and data linear addressing (DLA) on Intel chips to attempt to understand the structure and attribution of load latencies in MMTk.
After identifying L1 misses in the trace loop as a significant overhead, she adds prefetching and reduces GC time by 10% or so across a range of benchmarks, and more on Zen4.
We did a deep dive on this ten years agoā¦.
https://www.steveblackburn.org/pubs/papers/yieldpoint-ismm-2015.pdf
@mgaudet I'm glad I could help.
Following up on the start of this thread on microarchitectural performance debugging, you might find Claire Huang's undergraduate thesis interesting (co-supervised by @steveblackburn). She took a very detailed look at GC in terms of cache behaviors, and measured optimization headroom. https://www.steveblackburn.org/pubs/theses/huang-2025.pdf
She also previously led the eBPF tracing in GC work (MPLR'23).
@samth @wingo @sayrer
I think you made that very same confusion in your response: āheaps from 100g upā.
What do you mean?
If you mean to say that these algorithms are designed only for environments where thereās 64X memory *headroom* (as in the example you gave), then fine, but we need to be call that out explicitly.
If you mean to say that these algorithms are designed for *live heaps* of 100G, then fine but you need to be clear about that, and point out what these applications are. For example, are they pointer rich? Or are they fillies with vast arrays of non pointer data? Who uses these workloads?
My take is that it is absolutely not okay to hide behind this ambiguous language.
If your algorithm needs 64X headroom, say that clearly.
If youāve built for apps with 100GB of live data, tell us what they are and, getting back to the paper, contribute benchmarks.
@steveblackburn @wingo @sayrer Right. In the case of the presentation linked above by @samth , the minimum heap size required to run SPECjbb is less than 2GB (https://www.spec.org/jbb2015/docs/userguide.pdf), so when benchmarking on 128GB of heap size, it's at least **64x** the minimum, which is extremely generous.
Wait up. Letās not confuse ālarge heapsā with āgenerous heapsā. This is the number one fallacy I hear in these discussions.
When someone says they ran āin a 100GB heapā, theyāre not saying anything about how generous that heap is, how much DRAM theyāre burning to get that number. Perhaps G1 could have run in 5GB with similar performance. Unless the min heap is stated, such comments are meaningless or downright misleading.
Collectors trade time for space, and this family of collectors generally needs massive headroom to get okay performance. You can see that plain as day in our paper.
What people sometimes allude to is that because it does not copy concurrently and is strictly copying, G1 will have very costly pauses when the live heap is very large. This is where C4 et al are better. Though LXR also addresses this, without the massive space overhead required.
I think youāre asking why did the industry make such inefficient collectors, given theyāre not fools.
In short, there was little incentive to do better. There are two high level issues:1. Methodology. 2. Lack of innovation.
The costs we expose simply were not being measured. Two broad failings: not understanding overhead (addressed by LBO). Not measuring user-experienced latency.
You can āsolveāthe first one by discounting resources (making out that theyāre free). Gil Tene has often pointed out that most cloud VMs are priced so that they come vastly over-provisioned with memoryāso use it!! The falicy here is that ultimately someone is paying for that memory and the power used to keep it on. This is why those collectors are not used by data center scale applications, where the cost of those overheads is measured in $10Mās. Second, most GC vendors focus on āGC pause time,ā which is a proxy metric that doesnāt translate to user-experienced pauses, which is where the rubber hits the road.
With that backdrop, and those incentives, the lack of innovation is unsurprising. G1, C4, Shen. and ZGC all share the same basic design, which is fundamentally limited by its complete dependence on tracing (slow reclamation, high drag), and strictly copying, which is brutally expensive when made concurrent. (See our LXR paper for a completely different approach.)
So it should be no surprise that companies like Google are so dependent on C++ ā those collector costs are untenable for many / most data center scale applications.
@wingo @pervognsen You called? :-)
Always happy to engage with GC discussion
@shriramk Nice. I donāt know of any, but we use this for the dozen or so PLDI organising committee roles, and it is gold. Highly recommend the pattern. Not sure why we donāt do it with PCC.
@wingo we mirror the multithreaded allocator used by the mutator. So each thread has its own thread-local allocator. This is true for both bump allocators and freelist allocators.
@wingo yep, all the generational collectors Iāve built have done parallel nursery collection. The principal overhead is the atomic operation on the mark/forward to avoid the race among collector threads.
@wingo Parallel copying GC is pretty standard. One of the first GCs i wrote was parallel semi space for JalapeƱo (JikesRVM), as a postdoc. That was > 20 years ago šAll of the GCs Iāve written since are parallel. š
PLDI SRC submission deadline is in 2 days! https://pldi23.sigplan.org/track/pldi-2023-src
PLDI SRC has a two-track model that supports both in-person and remote presentations.
This year's PLDI is also part of FCRC (https://fcrc.acm.org/) which contains many different and interesting CS conferences.
@adrian @ZeusHammer likewise. Iām afraid I have nothing to offer on that. Itās a great question.
Whoopsieā¦.
@ZeusHammer @csgordon Industry.
The SPEC benchmarks existed because industry agreed they needed a ālevel playing fieldā for comparing their products. The fee is presumably covering some of the development costs.
TBH it all seems very 1990s to me :-)
Also, developing good benchmarks is incredibly expensive. :-/
@csgordon @ZeusHammer I donāt know, Emma, itās been a long time since Iāve used SPEC CPU. But Iām not aware of it setting anything, just recording the details of your hardware as part of the reporting process.