Lmst

#swad

For two days straight, I just can't reproduce #swad #crashing with *anything* in place (#clang #sanitizer instrumentation, attached #debugger like #lldb) that could give me the slightest hint what's going wrong. 😡

But it *does* crash when "unobserved". And it looks like this is happening a lot sooner (or, more often?) when using #LibreSSL ... but I also suspect this could be a red herring in the end.

Situation reminds me of my physics teacher back at school, who used to say something in german I just can't ever forget:

"Wer misst, misst Mist."

Feeble attempt in english would be "the one who measures measures crap", it was his humorous way to bring one consequence of #Heisenberg's indeterminacy principle to the point. And indeed, #debugging computer programs always suffers from similar problems...

I need help. First the question: On #FreeBSD, with all ports built with #LibreSSL, can I somehow use the #clang #thread #sanitizer on a binary actually using LibreSSL and get sane output?

What I now observe debugging #swad:

- A version built with #OpenSSL (from base) doesn't crash. At least I tried very hard, really stressing it with #jmeter, to no avail. Built with LibreSSL, it does crash.
- Less relevant: the OpenSSL version also performs slightly better, but needs almost twice the RAM
- The thread sanitizer finds nothing to complain when built with OpenSSL
- It complains a lot with LibreSSL, but the reports look "fishy", e.g. it seems to intercept some OpenSSL API functions (like SHA384_Final)
- It even complains when running with a single-thread event loop.
- I use a single SSL_CTX per listening socket, creating SSL objects from it per connection ... also with multithreading; according to a few sources, this should be supported and safe.
- I can't imagine doing that on a *single* thread could break with LibreSSL, I mean, this would make SSL_CTX pretty much pointless
- I *could* imagine sharing the SSL_CTX with multiple threads to create their SSL objects from *might* not be safe with LibreSSL, but no idea how to verify as long as the thread sanitizer gives me "delusional" output 😳

Fixed *that* issue by making sure each instance of the Process class has an owning thread, but forks the child on the main thread and receives exit events from there, delegating that info back to the owing thread. Seems to work.

Now, I can still make #swad crash. But no matter what I tried so far, as soon as I build it with both #debugging symbols and the #thread #sanitizer, I just can't reproduce a crash.

Now what? 😞🤷

Yep, there's a second bug. #clang #thread #sanitizer had nothing to complain, and the output from #assert doesn't help much. So, first step: "pimp your assert" 😂 --- #FreeBSD, like some other systems, provides functions to collect and print rudimentary stacktraces, use these if available:
https://github.com/Zirias/poser/commit/c45dd56312dd05b6bf02a27bf9e39eb31331f05a

Now I got closer, see screenshot. That's enough to understand, the issue is with the global event firing when a #child #process exits, this was used from multiple threads. Ok, it obviously doesn't work that way, so, back to the drawing board regarding my handling for child processes... 🤔

Next #swad release: Soon, so I hope 🙈

swad printing a stacktrace for a failed assert, filtering that with "addr2line" to obtain more useful information.

Finally, a lead what could still cause my development version of #swad to crash!

Ok, this looks really weird, the failed assertion at the bottom means a thread ends up fiddling with an event that's owned by a different thread.

But hey, at least now I have stacktraces of what's happening.

Next #swad release will still be a while. 😞

I *thought* I had the version with multiple #reactor #eventloop threads and quite some #lockfree stuff using #atomics finally crash free. I found that, while #valgrind doesn't help much, #clang's #thread #sanitizer is a very helpful debugging tool.

But I tested without #TLS (to be able to handle "massive load" which seemed necessary to trigger some of the more obscure data races). Also without the credential checkers that use child processes. Now I deployed the current state to my prod environment ... and saw a crash there (only after running a load test).

So, back to debugging. I hope the difference is not #TLS. This just doesn't work (for whatever reason) when enabling the address sanitizer, but I didn't check the thread sanitizer yet...

Feels like every time I try to reduce #memory usage, I accidentally improve #throughput instead. At least THIS time, I also see reduced memory usage, nice!

#swad #coding #c #unusual #issue

swad now reaching over 31k req/s on the same hardware.

Slow and steady progress making #swad fit for heavy traffic: Adding thread-specific "object pools" for the connection objects representing the clients of one server seems to have reduced the growth of the resident set! 🥳 It may sound counter-intuitive to *save* memory by *not* returning any ,,, but that's what I observe 🙈 And it also improved throughput further!

I'll apply that principle to even more objects.

Meanwhile, silly behavior regarding scheduling increased. For a while, I was observing one service worker thread being twice as busy as all the others ... now that picture was completed by one being especially lazy. What the ....?

Threads of swad during load test. The uppermost thread is the "main" thread, handling signals and accepting connections. The 8 threads at the bottom are the service workers, each running its own event loop, handling the socket connections of clients and doing basic HTTP parsing. The threads in between are the pool threads, used for executing individual request pipelines.

The #lockfree command #queue in #poser (for #swad) is finally fixed!

The original algorithm from [MS96] works fine *only* if the "free" function has some "magic" in place to defer freeing the object until no thread holds a reference any more ... and that magic is, well, left as an exercise to the reader. 🙈

Doing more research, I found a few suggestions how to do that "magic", including for example #hazardpointers ... but they're known to cause quite some runtime overhead, so not really an option. I decided to implement some "shared object manager" based on the ideas from [WICBS18], which is kind of a "manually triggered garbage collector" in the end. And hey, it works! 🥳
https://github.com/Zirias/poser/blob/master/src/lib/core/sharedobj.c

[MS96] https://dl.acm.org/doi/10.1145/248052.248106
[WICBS18] https://www.cs.rochester.edu/u/scott/papers/2018_PPoPP_IBR.pdf

#coding #c #c11 #atomics

This redesign of #poser (for #swad) to offer a "multi-reactor" (with multiple #threads running each their own event loop) starts to give me severe headaches.

There is *still* a very rare data #race in the #lockfree #queue. I *think* I can spot it in the pseudo code from the paper I used[1], see screenshot. Have a look at lines E7 and E8. Suppose the thread executing this is suspended after E7 for a "very long time". Now, some dequeue operation from some other thread will eventually dequeue whatever "Q->Tail" was pointing to, and then free it after consumption. Our poor thread resumes, checks the pointer already read in E6 for NULL successfully, and then tries a CAS on tail->next in E9, which is unfortunately inside an object that doesn't exist any more .... If the CAS succeeds because at this memory location happens to be "zero" bytes, we corrupted some random other object that might now reside there. 🤯

Please tell me whether I have an error in my thinking here. Can it be ....? 🤔

Meanwhile, after fixing and improving lots of things, I checked the alternative implementation using #mutexes again, and surprise: Although it's still a bit slower, the difference is now very very small. And it has the clear advantage that it never crashes. 🙈 I'm seriously considering to drop all the lock-free #atomics stuff again and just go with mutexes.

[1] https://dl.acm.org/doi/10.1145/248052.248106

Pseudo-code of a lockfree enqueue operation

I now implemented a per-thread #pool to reuse #timer objects in #poser (my lib I use for #swad).

The great news is: This improved performance, which is an unintended side effect (my goal was to reduce RAM usage 🙈😆). I tested with the #kqueue backend on #FreeBSD and sure, this makes sense: So far, I needed to keep a list of destroyed timers that's always checked to solve an interesting issue: By the time I cancel a timer with #kevent, the expiry event might already be queued, but not yet read by my event loop. Trying to fire events from a timer that doesn't exist any more would segtfault of course. Not necessary any more with the pool approach, the timer WILL exist and I can just check whether it's "alive".

The result? Same hardware as always, and now swad reaches a throughput of 26000 requests per second with (almost) perfect response times. 🥳

I'm still not happy with memory usage. It's better, and I have no explanation for what I oberved now:

Ran the same test 3 times, 1000 #jmeter threads each simulating a distinct client running a loop for 2000 times doing one GET and one POST for a total of 4 million requests. After the first time, the resident set was at 178MiB. After the second time, 245 MiB. And after the third time, well, 245 MiB. How ...? 🤯

Also, there's another weird observation I have no explanation for. My main thread delegates accepted connections to worker threads simply "round robin". And each time I run the jmeter test, all these worker threads show increasing CPU usage at a similar rate, until suddenly, one single thread seems to do "more work", which stabilizes when this thread is utilizing almost double the CPU as all other worker threads. And when I run the jmeter test again (NOT restarting swad), the same happens again, but this time, it's a *different* thread that "works" a lot more than all others.

I wonder whether I should accept scheduling, memory management etc. pp are all "black magic" and swad is probably "good enough" as is right now. 😆

Currently trying to tackle the #RAM consumption issue I see in #swad by introducing thread-local #pools (using simple linked lists) for frequently created and destroyed #objects like event-handler entries, timers, maybe also the connections served by a specific server ... instead of allocating a new one, the pool is asked for one, and instead of (fully) destroying one, it's returned to that pool.

It's a shot in the dark and will actually enforce memory for these is (almost) never free'd, but I hope it will avoid #heap #fragmentation. We'll see how it's going 🙈

I just stress-tested the current dev state of #swad on #Linux. The first attempt failed miserably, got a lot of errors accepting a connection. Well, this lead to another little improvement, I added another static method to my logging interface that mimics #perror: Also print the description of the system errno. With that in place, I could see the issue was "too many open files". Checking #ulimit -n gave me 1024. Seriously? 🤯 On my #FreeBSD machine, as a regular user, it's 226755. Ok, bumped that up to 8192 and then the stress test ran through without issues.

On a side note, this also made creating new timers (using #timerfd on Linux) fail, which ultimately made swad crash. I have to redesign my timer interface so that creating a timer may explicitly fail and I can react on that, aborting whatever would need that timer.

Anyways, the same test gave somewhat acceptable results: throughput of roughly 3000 req/s, response times around 500ms. Not great, but okayish, and not directly comparable because this test ran in a #bhyve vm and the requests had to pass the virtual networking.

One major issue is still the #RAM consumption. The test left swad with a resident set of > 540 MiB. I have no idea what to do about that. 😞 The code makes heavy use of "allocated objects" (every connection object with metadata and buffers, every event handler registered, every timer, and so on), so, uses the #heap a lot, but according to #valgrind, correctly frees everything. Still the resident set just keeps growing. I guess it's the classic #fragmentation issue...

@fanf Sure that does make sense. I'll try to verify jmeter indeed doesn't reuse connections (I already have debug logging in place that should tell me).

If that's really the reason, I guess the sane thing to do is to add a hint to the docs to just disable TLS for very busy sites. The intended usecase for #swad is operation behind #nginx to serve its "auth_request". I don't intend to implement HTTP/2 or beyond, but it would be pretty pointless here anyways, nginx defaults to HTTP/1.0 for proxy requests and can be configured to use HTTP/1.1 instead, but *still* doesn't reuse connections by default, and my experiments so far to enable it weren't successful, maybe I didn't fully understand it yet. Using TLS behind nginx would make sense from a "defense in depth" point of view, but it's probably impractical once your load exceeds a certain threshold.

For background how I arrived there, I observed stupid #AI #scraper #bots clog my DSL connection by downloading gigabytes of build logs produced by my #poudriere. They're not secret in any way and having a simple way to share them is great for community bug hunting, but this had to stop. I had a simple C library doing a fully portable reactor event loop on top of select (so, not really scalable), and some very limited HTTP/1.1 server code from experiments with TOR hidden services ... so I put that together to add some web-form + cookies auth to my private nginx to lock out the bots. Later, I added a "guest login" doing the same "proof of work" stuff known from #anubis, and then I suddenly had the idea in mind to make my little service (that already solved the problem perfectly for myself) suitable for large-scale installations. So, added kqueue, epoll etc support, added a "multi-reactor with acceptor-connector" design, etc .... and now I'm a bit frustrated enabling TLS spoils all the performance 🙈

More interesting progress trying to make #swad suitable for very busy sites!

I realized that #TLS (both with #OpenSSL and #LibreSSL) is a *major* bottleneck. With TLS enabled, I couldn't cross 3000 requests per second, with somewhat acceptable response times (most below 500ms). Disabling TLS, I could really see the impact of a #lockfree queue as opposed to one protected by a #mutex. With the mutex, up to around 8000 req/s could be reached on the same hardware. And with a lockfree design, that quickly went beyond 10k req/s, but crashed. 😆

So I read some scientific papers 🙈 ... and redesigned a lot (*). And now it finally seems to work. My latest test reached a throughput of almost 25k req/s, with response times below 10ms for most requests! I really didn't expect to see *this* happen. 🤩 Maybe it could do even more, didn't try yet.

Open issue: Can I do something about TLS? There *must* be some way to make it perform at least a *bit* better...

(*) edit: Here's the design I finally used, with a much simplified "dequeue" because the queues in question are guaranteed to have only a single consumer: https://dl.acm.org/doi/10.1145/248052.248106

Throuput curve of my latest stress test of swad (with ramp-up and ramp-down phase)

Response times in percentiles. 97% stay below 10ms!

Getting somewhat closer to releasing a new version of #swad. I now improved the functionality to execute something on a different worker thread: Use an in-memory queue, providing a #lockfree version. This gives me a consistent reliable throughput of 3000 requests/s (with outliers up to 4500 r/s) at an average response time of 350 - 400 ms (with TLS enabled). For waking up worker threads, I implemented different backends as well: kqueue, eventfd and event-ports, the fallback is still a self-pipe.

So, #portability here really means implement lots of different flavors of the same thing.

Looking at these startup logs, you can see that #kqueue (#FreeBSD and other BSDs) is really a "jack of all trades", being used for "everything" if available (and that's pretty awesome, it means one single #syscall per event loop iteration in the generic case). #illumos' (#Solaris) #eventports come somewhat close (but need a lot more syscalls as there's no "batch registering" and certain event types need to be re-registered every time they fired), they just can't do signals, but illumos offers Linux-compatible signalfd. Looking at #Linux, there's a "special case fd" for everything. 🙈 Plus #epoll also needs one syscall for each event to be registered. The "generic #POSIX" case without any of these interfaces is just added for completeness 😆

I now added a #lockfree version of that MPMC job queue which is picked when the system headers claim that pointers are lockfree. Doesn't give any measurable performance gain 😞. Of course the #semaphore needs to stay, the pool threads need something to wait on. But I think the reason I can't get more than 3000 requests per second with my #jmeter stress test for #swad is that the machine's CPU is now completely busy 🙈.

Need to look into actually saving CPU cycles for further optimizations I guess...

Finally getting somewhere working on the next evolution step for #swad. I have a first version that (normally 🙈) doesn't crash quickly (so, no release yet, but it's available on the master branch).

The good news: It's indeed an improvement to have *multiple* parallel #reactor (event-loop) threads. It now handles 3000 requests per second on the same hardware, with overall good response times and without any errors. I uploaded the results of the stress test here:

https://zirias.github.io/swad/stress/

The bad news ... well, there are multiple.

1. It got even more memory hungry. The new stress test still simulates 1000 distinct clients (trying to do more fails on my machine as #jmeter can't create new threads any more...), but with delays reduced to 1/3 and doing 100 iterations each. This now leaves it with a resident set of almost 270 MiB ... tuning #jemalloc on #FreeBSD to return memory more promptly reduces this to 187 MiB (which is still a lot) and reduces performance a bit (some requests run into 429, overall response times are worse). I have no idea yet where to start trying to improve *this*.

2. It requires tuning to manage that load without errors, mainly using more threads for the thread pool, although *these* threads stay almost idle ... which probably means I have to find ways to make putting work on and off these threads more efficient. At least I have some ideas.

3. I've seen a crash which only happened once so far, no idea as of now how to reproduce. *sigh*. Massively parallel code in C really is a PITA.

Seems the more I improve here, the more I find that *should* also be improved. 🤪

#C #coding #performance

Mid way implementing "multi-threaded #reactor" (multiple event loops in parallel) in #poser for usage in #swad and it's going like expected. Debugging heavily multi-threaded stuff is really a PITA. 🤯

Somehow, data is currently corrupted on finished thread jobs...

I just fixed a "horrible" bug in #swad:

https://github.com/Zirias/poser/commit/fcd8f4eb44d9676dde2546042b5fe3165aecc52c

In case you don't understand C: This potentially dereferenced "wild" and null pointers before the (copy-and-pasted 🙈) typo was fixed, which means it's "undefined behavior", so might do surprising things, but more likely crash.

It affects the #epoll (on #Linux) and #eventports (on #Solaris / #illumos) backends. A quick smoke test on these platforms was done in swad 0.11 and didn't show any unexpected behavior. Only after preparing for the next release (that hopefully has multiple parallel event loops) by moving some static service data to thread-local storage, it suddenly failed on illumos, that's how I tracked down that embarrasing crap. 😞

I hope to complete a new version soon enough, so I don't have to do a "bugfix release" for it.

#swad

Client Info