J. Kalyana Sundaram

Love to demystify topics related to Reliability, Perf, and Observability of cloud systems.

Principal Software Engineer in Azure @ Microsoft working on building Observability platforms. Previous contributions include Azure Event Grid, Windows 10 Search, BizTalk RFID server, Windows Update etc.

I also serve as the co-chair for the W3C Distributed Tracing working group where we work on the TraceContext and Baggage specifications.

Love hiking, singing, board games, travel.

All opinions my own.

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-04-22

Used an app to measure the sound levels of my electric razor: It is 80+ dB. I had always felt that it was very loud but hadn't done anything about it. Now, I have started using it with regular earplugs and it is much better!

The blender in my kitchen is likely 90+ dB - planning to use earplugs for it.

I realized only recently that the decibel scale is a logarithmic scale. So, compared to 60 dB, 80 dB is 100 times more intense and 4 times louder.

Musical instruments that my kids play are at 90 - 95 dB - dangerous sound levels. Ordered some "music earplugs" that won't muffle the sound...

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-03-15

The good news is that this SDK was instrumented using OpenTelemetry. So, it emitted spans for various key operations.

Due to the extensible nature of OpenTelemetry, all the spans (whether they be spans from this SDK, or spans emitted by our service) can be exported to a backend of choice, where the trace can be reconstructed.

By viewing such a representative trace, we were able to pinpoint the exact sub-operation within the SDK that was causing the delay.

As more such libraries instrument using OpenTelemetry, Distributed Tracing can improve Observability even WITHIN the context of a single application/service, and save several hours of back-and-forth investigations...

This is the beauty of OpenTelemetry's mission "to enable effective observability by making high-quality, portable telemetry ubiquitous."

(Thread 3/3)

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-03-15

Typically this would involve lengthy back-and-forth interactions between the owners of the service, the owners of the SDK, the owners of the various services that the SDK is calling, etc.

Adding more logs or metrics to *our* service wouldn't have helped, of course.

So, how did DT help here?

(Thread 2/3)

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-03-15

Many people think of "Distributed Tracing" as being valuable only when MANY apps/services in a system participate in it.

But even if you enable it for just a SINGLE service in your system, it can save you hours...

Here is WHY (based on a real issue):

We have one service that uses an open-source SDK to achieve a part of its functionality.

One day, we saw an issue where the calls made to this SDK are taking several seconds to complete.

Our service itself is instrumented using #OpenTelemetry Tracing APIs. But how do we diagnose this problem happening in the library it uses?

(Thread 1/3)

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-02-19

A short hike at Seward Park, Seattle…

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-02-13

@adrianco Thanks for your feedback, and thanks for sharing the additional context.

Yes, that sounds similar to the hedged requests approach, will check out the Ribbon client-side load balancer!

J. Kalyana Sundaramkalyanaj@discuss.systems
2023-02-12

Earlier this week, I read the "The Tail At Scale" paper by Jeffrey Dean and Luiz André Barroso.

I really liked the intuitive techniques described in it.

I wrote the below blog post to try draw an analogy with a physical world example, and to summarize my main takeaways from it on:

- What is tail latency
- Why should we care about it
- Why reducing component level variability is not sufficient
- Two classes of patterns to become tail-tolerant

blog.techlanika.com/reducing-t

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-29

The recent problems with Southwest Airlines is a good example of a Metastable failure at scale in the physical world:

TRIGGERs: Capacity reducing triggers (reduced staff capacity due to sickness, snow storms at Denver, Chicago, and the rest of the country).

AMPLIFICATION: Capacity degradation amplification caused by a combination of factors such as:
—-point to point business model meant the crew is not in the right places,
—-scheduling software breaking down resulting in manual matching of flights to crews - (can’t even imagine how tedious this would have been…kudos to the manual schedulers)
—-crew not able to communicate with the airlines (!) due to phone systems being down, likely due to a metastable failure of the phone system caused by overload due to customers trying to reach the airline for rescheduling..

So, even if the matching of a flight to a crew was done, the crew might not have been aware of that assignment! So, even as “system capacity” (airport, flights, crew) started becoming available, they couldn’t be used effectively…

MITIGATION: As with many metastable failure mitigations, load shedding was the mitigation- they temporarily reduced the number of flights to 1/3rd of the usual number…

Looks like the airline was running the system in an extremely vulnerable state (optimizing for high turnaround time to improve efficiency and packing the schedule without any headroom to handle overloads caused by capacity degradation).

Hope they do a thorough incident analysis using the metastable failure framework and make improvements…

References:

cnn.com/2022/12/27/business/so

cnn.com/2022/12/29/business/so

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-23

@dan Thanks for taking the time to share the above great information, including the various links. Will check it out!

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-23

Just published this blog post on the beauty of the 4 positive feedback loops that writing brings to the table:

blog.techlanika.com/4-reasons-

Feedback/comments welcome.

Happy Holidays!

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-22

To read an industry/research paper, to understand/internalize it reasonably well, and more importantly to *retain* better the learnings from it (to be able to apply them later), two good forcing functions I have found useful are:

- Presenting/talking about it in a forum; committing to doing this creates an accountability/deadline to get it done.

(and/or)

- Writing/blogging about it in my own words.

[Edit: And of course, the above should be done with the goal that attendees or readers get good value out of it (since they are "paying" for it with their time)].

Along those lines, I had the opportunity to share my learnings from the TAOBench paper (vldb.org/pvldb/vol15/p1965-che) last month in the "Distributed Systems Reading Group" (thanks for the opportunity!) that's organized by Aleksey Charapko (charap.co/).

In case you find it useful, here's a link to my presentation: youtube.com/watch?v=PClXmtEetg.

The same channel (DistSys Reading Group channel) also has recordings to a bunch of other presentations as well.

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-22

@dan Congrats!

Over time, as users here follow more people from other instances (and vice versa), I am assuming the capacity needs will go up (particularly for "hot" accounts)...

Hence, curious to learn how things are from a scalability perspective - e.g., which components can be horizontally scaled, and which ones cannot be?

J. Kalyana Sundaram boosted:
Murat Demirbas (Distributolog)muratdemirbas@fediscience.org
2022-12-22
J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-22

Learning a bit about the Fediverse and the W3C ActivityPub protocol, and came across this interesting article that makes the case for design decisions/protocols to incentivize decentralization:

ar.al/2022/11/09/is-the-fedive

J. Kalyana Sundaram boosted:
2022-12-22

Time to upgrade my skills! A nice CACM article on "The End of Programming" by Matt Welsh: cacm.acm.org/magazines/2023/1/ 🙂

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-21

@ricci Thanks for the paper recommendation! Yes, totally agree it is better to understand how things work behind the scenes.

I will check out this paper, (thanks @rfonseca and co-authors)!

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-21

(Thread) The other paper I found very useful was Keeping CALM: When Distributed Consistency is Easy (arxiv.org/pdf/1901.01930.pdf).

It helped me learn about co-ordination cost, how to think about it, and patterns we can use to build efficient distributed systems.

I blogged about my learnings from this paper here: blog.techlanika.com/avoiding-c

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-21

(Thread) I liked "Metastable Failures in the Wild" (usenix.org/system/files/osdi22) quite a lot.

I found it useful to apply even in a variety of real-world situations.

It helps understand the balance between capacity and load, the degrees of vulnerability of a system, types of triggers, and types of amplification mechanisms.

It also led to learn about various patterns to design systems (e.g., constant work pattern) to reduce the probability of getting into a metastable state.

I blogged about my observations from the V1 of this paper:
blog.techlanika.com/metastable

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-21

(Thread) Which was the best systems paper you read in 2022 and would recommend to others?

It could be something where:

- you learnt something new (or)
- found it useful to apply in many situations. (or)
- or found it fascinating
- etc.

I will start below with a couple of papers I liked:

J. Kalyana Sundaramkalyanaj@discuss.systems
2022-12-20

Thanks Carlos!

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst