#Monitorama24

Andrew Rodgersacedrew
2024-06-12

Gathering data from lots of different sources, providing it to the model then processing the output through post processing steps? That sounds like an asynchronous workflow, USE TRACING ~ @cartermp lauding the benefits of tracing for deploying and iterating AI applications

Speaker in front of slide showing a waterfall trace of context gathering in an AI application
Andrew Rodgersacedrew
2024-06-12

“Disintegration is a feature not a bug, of asynchronous systems, until you introduce telemetry and monitoring” ~ Johannes Tax from @grafana

The speaker on stage
Andrew Rodgersacedrew
2024-06-12

Johannes Tax at @grafana describing the pain of distributed tracing in asynchronous systems “Disintegrated telemetry: The pains of monitoring asynchronous workflows”

Speaker on stage in front of slide explaining the definition of words in the title: pain: An unpleasant sensory and emotional experience disintegration: the process of losing cohesion or strength workflows: asynchronous communication between different services messaging and eventing
Andrew Rodgersacedrew
2024-06-12

“Name your metrics, alerts and dashboards with the language you would use if you were having a conversation about the system at lunch, not the cryptic defaults or uuid hostnames” ~ @danslimmon

Speaker behind a lectern on stage
Andrew Rodgersacedrew
2024-06-12

We’ve all heard the definition for observability, but, who is one? What is involved in determining? ~ @danslimmon presenting “No observability without theory” You need a theory of the system to build valid inferences

Speaker at a lectern in front of a screen that says “Theory enables understanding.”
Andrew Rodgersacedrew
2024-06-12

Julia Thoreson at Bloomberg sharing “Incident Management: Lessons from Emergency Services” breaking down how the lessons learned in emergency services can apply to incident management in technical systems

Speaker on stage in front of screen gesturing
Andrew Rodgersacedrew
2024-06-12

Pete Fritchman’s Takeaways on managing internal services effectively:
Internal Services impact customers
Leverage your observability tools
Talk to your internal customers
*APPLY SRE PRINCIPLES*

Speaker in front of slide, text in post
Andrew Rodgersacedrew
2024-06-12

“New hires are super value able in your internal customer interviews, they actually expect things to work and aren’t bitter yet” ~ Pete Fritchman

Speaker behind lectern and in front of screen
Andrew Rodgersacedrew
2024-06-12

“Treat your internal tooling outages like the most critical production outages, because they’ll always hit when you’re trying to recover from a critical production outage” ~ Pete Fritchman

Speaker in front of screen
Andrew Rodgersacedrew
2024-06-12

“The shoemaker’s children have no shoes - why SRE teams must help themselves” Pete Fritchman making the case for investing in watching the watchmen, and techniques for accomplishing it.

Speaker in front of screen with problem statement: internal services often lack the full “SRE treatment”
Andrew Rodgersacedrew
2024-06-12

Hashmaps to counts work great for small sets, but what happens when you need to count sets larger than memory? You need HyperLogLog or Disjunctive Normal Form (CVM) ~ @phredmoyer

Speaker behind a lectern with screen showing a breakdown of the disjunctive normal form algorithm
Andrew Rodgersacedrew
2024-06-12

“Use counters to count things” @phredmoyer providing some examples where counting things by processing petabytes of log or trace data is prohibitively expensive and justify spending a bit more on dealing with higher cardinality metrics.

Speaker behind lectern with screen behind him
Andrew Rodgersacedrew
2024-06-12

Baggage is bad for your relationships, good for your service graphs. @kalyanaj makes the case for an arbitrary key value metadata store (baggage) to propagate through your services to enable controllability and observability use cases.

Speaker in front of a slide showing how metadata propagates through a dependency graph that enables better context
Andrew Rodgersacedrew
2024-06-12

“Distributed Context Propagation: How you can use it to Improve Observability, Test in Production, and more...” @kalyanaj explaining the importance of context in interpreting observability data

Speaker behind lectern and in front of screen
Andrew Rodgersacedrew
2024-06-12

“Every team has a different answer for discovering what the dependencies of their services are, some say firewall rules, some look at network flows, tracing gives us a uniform answer to this” ~ Sudeep Kumar

Speaker making a point
Andrew Rodgersacedrew
2024-06-12

“We have so many microservices, people are always looking for an excuse to create more, and no one knows which ones they’re already dependent on” ~ Sudeep Kumar from Salesforce with “Tracing Service Dependencies at Salesforce”

Speaker in behind lectern and in front of screen
Andrew Rodgersacedrew
2024-06-11

“Low cardinality in Prometheus and low cardinality in Clickhouse are vastly different things” - @colind in his talk “Experiments in Backing Prometheus with Clickhouse”

Colin Douch in front of a slide that says “Low cardinality to Prometheus is vastly different to low cardinality to clickhouse”
Andrew Rodgersacedrew
2024-06-11

“If we’re being honest, we all, generally, have a visceral reaction to people trying to get us to adopt new tools” ~ Noa Levi describing how a forced migration and the associated conversations between the engineering and observability teams facilitated adoption without adoption as a stated goal

Noa Levi in front of a screen presenting
Andrew Rodgersacedrew
2024-06-11

Noa Levi presenting “How we tricked engineers into utilizing distributed tracing” on her experience getting tracing adopted at Strava

Noa Levi, speaking in front of a screen
Andrew Rodgersacedrew
2024-06-11

“a little bit of work in reducing latency at the beginning of the data pipeline can provide orders of magnitude lower cost at the query end of the pipeline” ~ @djosephsen

Dave Josephen making an amazingly interesting point, hand motions required

Client Info

Server: https://mastodon.social
Version: 2025.04
Repository: https://github.com/cyevgeniy/lmst