#LockFree

Joe Seighjwseigh
2025-06-17

can be make
threadnought.wordpress.com/202
There's about 3 or so ways of doing it, one of which has been published already. Seems to run about 50% faster in the async memory barrier version, about .7 nsecs vs 1.0 nsecs. Probably due to not having a conditional branch. No plans to implement it myself.

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-16

Next #swad release will still be a while. 😞

I *thought* I had the version with multiple #reactor #eventloop threads and quite some #lockfree stuff using #atomics finally crash free. I found that, while #valgrind doesn't help much, #clang's #thread #sanitizer is a very helpful debugging tool.

But I tested without #TLS (to be able to handle "massive load" which seemed necessary to trigger some of the more obscure data races). Also without the credential checkers that use child processes. Now I deployed the current state to my prod environment ... and saw a crash there (only after running a load test).

So, back to debugging. I hope the difference is not #TLS. This just doesn't work (for whatever reason) when enabling the address sanitizer, but I didn't check the thread sanitizer yet...

Joe Seighjwseigh
2025-06-16


I did a lock-free queue implementation which is kind of off piste for me since I work with deferred reclamation and a queue is not a something you want to use deferred reclamation for or test with. So basically you want to use a ring buffer so you don't need reclamation to avoid the ABA problem. Most of the implementations I've seen aren't actually lock-free.

Anyway, I can get about 5 million enqueued/dequeues /sec using eventcounts for synchronization.

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-13

The #lockfree command #queue in #poser (for #swad) is finally fixed!

The original algorithm from [MS96] works fine *only* if the "free" function has some "magic" in place to defer freeing the object until no thread holds a reference any more ... and that magic is, well, left as an exercise to the reader. πŸ™ˆ

Doing more research, I found a few suggestions how to do that "magic", including for example #hazardpointers ... but they're known to cause quite some runtime overhead, so not really an option. I decided to implement some "shared object manager" based on the ideas from [WICBS18], which is kind of a "manually triggered garbage collector" in the end. And hey, it works! πŸ₯³
github.com/Zirias/poser/blob/m

[MS96] dl.acm.org/doi/10.1145/248052.
[WICBS18] cs.rochester.edu/u/scott/paper

#coding #c #c11 #atomics

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-11

This redesign of #poser (for #swad) to offer a "multi-reactor" (with multiple #threads running each their own event loop) starts to give me severe headaches.

There is *still* a very rare data #race in the #lockfree #queue. I *think* I can spot it in the pseudo code from the paper I used[1], see screenshot. Have a look at lines E7 and E8. Suppose the thread executing this is suspended after E7 for a "very long time". Now, some dequeue operation from some other thread will eventually dequeue whatever "Q->Tail" was pointing to, and then free it after consumption. Our poor thread resumes, checks the pointer already read in E6 for NULL successfully, and then tries a CAS on tail->next in E9, which is unfortunately inside an object that doesn't exist any more .... If the CAS succeeds because at this memory location happens to be "zero" bytes, we corrupted some random other object that might now reside there. 🀯

Please tell me whether I have an error in my thinking here. Can it be ....? πŸ€”

Meanwhile, after fixing and improving lots of things, I checked the alternative implementation using #mutexes again, and surprise: Although it's still a bit slower, the difference is now very very small. And it has the clear advantage that it never crashes. πŸ™ˆ I'm seriously considering to drop all the lock-free #atomics stuff again and just go with mutexes.

[1] dl.acm.org/doi/10.1145/248052.

Pseudo-code of a lockfree enqueue operation
Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-11

I guess this funny looking graph showing response time percentiles is exactly the result of one of 8 service worker threads having a lot more to do than all others. I wonder whether this could be a behavioral artifact of the #lockfree #queue used to distribute the accepted connections. πŸ€”

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-06

More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a *major* bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. πŸ˜†

So I read some scientific papers πŸ™ˆ ... and redesigned a lot (*). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see *this* happen. 🀩 Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There *must* be some way to make it perform at least a *bit* better...

(*) edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: dl.acm.org/doi/10.1145/248052.

Throuput curve of my latest stress test of swad (with ramp-up and ramp-down phase)Response times in percentiles. 97% stay below 10ms!
Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-05

Getting somewhat closer to releasing a new version of #swad. I now improved the functionality to execute something on a different worker thread: Use an in-memory queue, providing a #lockfree version. This gives me a consistent reliable throughput of 3000 requests/s (with outliers up to 4500 r/s) at an average response time of 350 - 400 ms (with TLS enabled). For waking up worker threads, I implemented different backends as well: kqueue, eventfd and event-ports, the fallback is still a self-pipe.

So, #portability here really means implement lots of different flavors of the same thing.

Looking at these startup logs, you can see that #kqueue (#FreeBSD and other BSDs) is really a "jack of all trades", being used for "everything" if available (and that's pretty awesome, it means one single #syscall per event loop iteration in the generic case). #illumos' (#Solaris) #eventports come somewhat close (but need a lot more syscalls as there's no "batch registering" and certain event types need to be re-registered every time they fired), they just can't do signals, but illumos offers Linux-compatible signalfd. Looking at #Linux, there's a "special case fd" for everything. πŸ™ˆ Plus #epoll also needs one syscall for each event to be registered. The "generic #POSIX" case without any of these interfaces is just added for completeness πŸ˜†

swad startup on generic POSIXswad startup on Linuxswad startup on illumosswad startup on FreeBSD
Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-04

I now experimented with different ideas how to implement the #lockfree #queue for multiple producers and multiple consumers. Unsurprisingly, some ideas just didn't work. One deadlocked (okaaay ... so it wasn't lockfree) and I eventually gave up trying to understand why.

The "winner" so far is only "almost lockfree", but at least slightly improves performance. Throughput is the same as with the simple locked variant, but average response times are 10 to 20% quicker (although they deviate stronger for whatever reason). Well, that's committed for now:

github.com/Zirias/poser/commit

#C11 #atomics

Felix Palmen :freebsd: :c64:zirias@bsd.cafe
2025-06-04

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 😞. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy πŸ™ˆ.

Need to look into actually saving CPU cycles for further optimizations I guess...

2025-03-04

ΠžΠ±ΡΡƒΠΆΠ΄Π°Π΅ΠΌ измСнСния Π² Go 1.24, ΠΌΡŒΡŽΡ‚Π΅ΠΊΡΡ‹ ΠΈ ΠΏΠ°ΠΊΠ΅Ρ‚ unsafe β€” ΠΎΡ‚ΠΊΡ€Ρ‹Ρ‚ΠΈΠ΅ сСзона ΠΌΠΈΡ‚Π°ΠΏΠΎΠ² для Π³ΠΎΡ„Π΅Ρ€ΠΎΠ² Π² МосквС

Π‘ ΠΏΡ€ΠΈΡ…ΠΎΠ΄ΠΎΠΌ вСсны ΠΈΠ·-ΠΏΠΎΠ΄ сугробов снова Π½Π°Ρ‡ΠΈΠ½Π°ΡŽΡ‚ ΠΏΡ€ΠΎΡ€Π°ΡΡ‚Π°Ρ‚ΡŒ ΠΏΠΎΠ»Π΅Π·Π½Ρ‹Π΅ ΠΌΠΈΡ‚Π°ΠΏΡ‹. На ΠΏΠ΅Ρ€Π²ΠΎΠΉ Π² сСзонС Go-сходкС ΠΎΡ‚ YADRO ΠΏΡ€Π΅Π΄Π»Π°Π³Π°Π΅ΠΌ ΠΏΡ€ΠΈΡΠΎΠ΅Π΄ΠΈΠ½ΠΈΡ‚ΡŒΡΡ ΠΊ ΠΎΠ±ΡΡƒΠΆΠ΄Π΅Π½ΠΈΡŽ ΠΈΠ·ΠΌΠ΅Π½Π΅Π½ΠΈΠΉ Go 1.24. ЭкспСрты ΠΈΠ· AvitoTech, Yandex ΠΈ YADRO ΠΏΠΎΠ΄ΠΈΡΠΊΡƒΡ‚ΠΈΡ€ΡƒΡŽΡ‚, ΠΊΠ°ΠΊ обновлСния ΠΏΠΎΠ²Π»ΠΈΡΡŽΡ‚ Π½Π° ΠΊΠΎΠ΄ Ρ€Π°Π·Ρ€Π°Π±ΠΎΡ‚Ρ‡ΠΈΠΊΠΎΠ². Π’Π°ΠΊΠΆΠ΅ Π²Ρ‹ ΡƒΠ·Π½Π°Π΅Ρ‚Π΅, ΠΊΠ°ΠΊ ΠΎΠ±Π΅ΡΠΏΠ΅Ρ‡ΠΈΡ‚ΡŒ Π²Ρ‹ΡΠΎΠΊΠΎΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒΠ½ΡƒΡŽ ΠΊΠΎΠ½ΠΊΡƒΡ€Π΅Π½Ρ‚Π½ΠΎΡΡ‚ΡŒ Π² Go ΠΈ с ΡƒΠΌΠΎΠΌ ΠΏΡ€ΠΈΠΌΠ΅Π½ΡΡ‚ΡŒ ΠΏΠ°ΠΊΠ΅Ρ‚ unsafe. ΠžΡ„Π»Π°ΠΉΠ½-участников ΠΆΠ΄Π΅Ρ‚ Π΄Π΅ΠΌΠΎΠ·ΠΎΠ½Π° с ΠΎΠ±ΠΎΡ€ΡƒΠ΄ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ для Π¦ΠžΠ” ΠΈ Ρ‚Π΅Π»Π΅ΠΊΠΎΠΌ-ΠΎΠΏΠ΅Ρ€Π°Ρ‚ΠΎΡ€ΠΎΠ², тСхничСскиС ΠΈΠ½Ρ‚Π΅Ρ€Π°ΠΊΡ‚ΠΈΠ²Ρ‹ ΠΈ ΠΏΠΎΠ΄Π°Ρ€ΠΊΠΈ.

habr.com/ru/companies/yadro/ar

#go #ΠΌΠΈΡ‚Π°ΠΏ #lockfree #unsafe #itконфСрСнция

CppConCppCon
2024-08-12

CppCon 2024 SESSION ANNOUNCEMENT: When Lock-Free Still Isn't Enough: An Introduction to Wait-Free Programming and Concurrency Techniques by Daniel Anderson

cppcon2024.sched.com/event/1gZ

Register now: cppcon.org/registration/

CppConCppCon
2024-08-07

CppCon 2024 SESSION ANNOUNCEMENT: Multi Producer, Multi Consumer, Lock Free, Atomic Queue - User API and Implementation by Erez Strauss

cppcon2024.sched.com/event/1gZ

Register now: cppcon.org/registration/

2024-06-28

Volatile, Lock-free, Immutable, Atomic Π² Java. Как ΠΏΠΎΠ½ΡΡ‚ΡŒ ΠΈ Π½Π°Ρ‡Π°Ρ‚ΡŒ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΠΎΠ²Π°Ρ‚ΡŒ

ΠŸΡ€ΠΈΠ²Π΅Ρ‚, мСня Π·ΠΎΠ²ΡƒΡ‚ ДСнис Агапитов, я Ρ€ΡƒΠΊΠΎΠ²ΠΎΠ΄ΠΈΡ‚Π΅Π»ΡŒ Π³Ρ€ΡƒΠΏΠΏΡ‹ Platform Core ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΠΈ Bercut. БСгодня Ρ…ΠΎΡ‡Ρƒ ΠΏΠΎΠ³ΠΎΠ²ΠΎΡ€ΠΈΡ‚ΡŒ ΠΎΠ± ΠΎΠ΄Π½ΠΎΠΌ ΠΈΠ· Π²Π°Ρ€ΠΈΠ°Π½Ρ‚ΠΎΠ² lock-free Π°Π»Π³ΠΎΡ€ΠΈΡ‚ΠΌΠΎΠ² Π² Java. Π Π°Π·Π±Π΅Ρ€Ρ‘ΠΌ ΠΊΠ°ΠΊ с Π½ΠΈΠΌ связано ΠΊΠ»ΡŽΡ‡Π΅Π²ΠΎΠ΅ слово volatile ΠΈ ΠΏΠ°Ρ‚Ρ‚Π΅Ρ€Π½ immutable .

habr.com/ru/companies/bercut/a

#java #atomic #volatile #multithreading #immutable #lockfree

ACCUConfACCUConf
2024-03-25

ACCU 2024 SESSION ANNOUNCEMENT: Introduction to Lock/Wait Free Algorithms: Defining and Understanding the Terms by Jeffrey Mendelsohn

accuconference.org/session/int

Register now at accuconference.org/pricing/

2024-03-01

[ΠŸΠ΅Ρ€Π΅Π²ΠΎΠ΄] Xv6: учСбная Unix-подобная ОБ. Π“Π»Π°Π²Π° 6. Π‘Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ

Π―Π΄Ρ€ΠΎ ОБ выполняСт ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΡ‹ ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½ΠΎ ΠΈ ΠΏΠ΅Ρ€Π΅ΠΊΠ»ΡŽΡ‡Π°Π΅Ρ‚ ΠΏΠΎΡ‚ΠΎΠΊΠΈ ΠΏΠΎ Ρ‚Π°ΠΉΠΌΠ΅Ρ€Ρƒ. ΠšΠ°ΠΆΠ΄Ρ‹ΠΉ процСссор выполняСт ΠΏΠΎΡ‚ΠΎΠΊ нСзависимо ΠΎΡ‚ Π΄Ρ€ΡƒΠ³ΠΈΡ…. ΠŸΡ€ΠΎΡ†Π΅ΡΡΠΎΡ€Ρ‹ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ ΠΎΠΏΠ΅Ρ€Π°Ρ‚ΠΈΠ²Π½ΡƒΡŽ ΠΏΠ°ΠΌΡΡ‚ΡŒ совмСстно, поэтому Π²Π°ΠΆΠ½ΠΎ Π·Π°Ρ‰ΠΈΡ‚ΠΈΡ‚ΡŒ структуры Π΄Π°Π½Π½Ρ‹Ρ… ΠΎΡ‚ ΠΎΠ΄Π½ΠΎΠ²Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠ³ΠΎ доступа. ΠŸΠΎΡ‚ΠΎΠΊΠΈ испортят Π΄Π°Π½Π½Ρ‹Π΅, Ссли процСссор ΠΏΠ΅Ρ€Π΅ΠΊΠ»ΡŽΡ‡ΠΈΡ‚ΡΡ Π½Π° Π΄Ρ€ΡƒΠ³ΠΎΠΉ ΠΏΠΎΡ‚ΠΎΠΊ, ΠΊΠΎΠ³Π΄Π° ΠΏΠ΅Ρ€Π²Ρ‹ΠΉ ΠΏΠΎΡ‚ΠΎΠΊ Π΅Ρ‰Π΅ Π½Π΅ Π·Π°Π²Π΅Ρ€ΡˆΠΈΠ» запись. ΠŸΠΎΡ‚ΠΎΠΊΠΈ ΠΊΠΎΠ½ΠΊΡƒΡ€ΠΈΡ€ΡƒΡŽΡ‚ Π·Π° доступ ΠΊ структурС Π΄Π°Π½Π½Ρ‹Ρ…. Π―Π΄Ρ€ΠΎ ΠΊΠΈΡˆΠΈΡ‚ структурами, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΏΠΎΡ‚ΠΎΠΊΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ совмСстно. Π‘Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ Π·Π°Ρ‰ΠΈΡ‰Π°ΡŽΡ‚ Π΄Π°Π½Π½Ρ‹Π΅ ΠΏΡ€ΠΈ ΠΊΠΎΠ½ΠΊΡƒΡ€Π΅Π½Ρ‚Π½ΠΎΠΌ доступС. Π“Π»Π°Π²Π° расскаТСт, Π·Π°Ρ‡Π΅ΠΌ Π½ΡƒΠΆΠ½Ρ‹ Π±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ, ΠΊΠ°ΠΊ xv6 Ρ€Π΅Π°Π»ΠΈΠ·ΡƒΠ΅Ρ‚ ΠΈ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΠ΅Ρ‚ Π±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ.

habr.com/ru/articles/797557/

#xv6 #Π±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ #прСрывания #Π²Π·Π°ΠΈΠΌΠΎΠ±Π»ΠΎΠΊΠΈΡ€ΠΎΠ²ΠΊΠΈ #ΠΏΠΎΡ‚ΠΎΠΊΠΈ #ΠΏΠ°Ρ€Π°Π»Π»Π΅Π»ΡŒΠ½ΠΎΠ΅_ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ #ΠΌΠ½ΠΎΠ³ΠΎΠΏΠΎΡ‚ΠΎΡ‡Π½ΠΎΡΡ‚ΡŒ #pthreads #lockfree

Gareth Lloyd (He/him)glloyd@fosstodon.org
2023-11-06

Code generation around atomics is strange.

What is desired is `lock cmpxchg16b` but it hard work to make compiler produce it.

clang has the best codegen IMO. It took me a while to realize I needed `alignas(16)`

gcc.godbolt.org/z/vTcP9ohsY

#cplusplus #cpp #lockfree

🎹 Tim Janik βœ…timj@social.tchncs.de
2022-11-03

Wrote a #LockFree (and Obstruction-Free) memory #allocator for use in low-latency real-time threads.

Blocks are 64-byte aligned to avoid false sharing, alloc+free calls of the underlying Bucket- and Bump-Allocators are O(1).

A seperate thread is notified when the Bump-Allocator reaches a watermark. It will then extend the pool of pre-allocated (and pre-faulted) memory, so the Bump-Allocator can continue to serve requests concurrently without syscalls.

github.com/tim-janik/anklang/b
#Anklang

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst