Distributed Tracing in Production: What We Learned Instrumenting Our .NET Services

There’s a specific kind of production incident I used to dread: the slow request that only happens under load, that doesn’t throw an exception, that isn’t obviously caused by any single service. It just takes too long. And you have no idea why.

Distributed tracing is the answer to that dread. Here’s what we actually learned rolling it out across our .NET services with OpenTelemetry and Tempo.

Starting with OpenTelemetry

We chose OpenTelemetry because it’s vendor-neutral. The instrumentation code is the same regardless of whether traces end up in Tempo, Jaeger, or a commercial platform — we’re not locked in. For ASP.NET Core services, auto-instrumentation covers HTTP requests, database calls via SqlClient or Entity Framework, and outgoing HTTP client calls out of the box. You get a surprising amount of signal just from the base setup.

Where Auto-Instrumentation Falls Short

Auto-instrumentation tells you what happened. It doesn’t always tell you why. For that, you need custom spans.

The biggest win we got was adding spans around our domain operations — not just “this HTTP request took 400ms” but “this specific business rule evaluation took 380ms of that 400ms.” Suddenly the trace told a story that matched how we actually think about the system.

We also learned to add attributes liberally. A span that says “database query” is less useful than one that says “database query, table=patient_records, rows_returned=4823.” The difference in diagnosability is enormous.

The Grafana + Tempo Stack

We send traces to Grafana Tempo and visualise them in Grafana alongside our Prometheus metrics and Loki logs. The correlation between the three is the real value — you can jump from a spike in a metrics dashboard straight into the traces happening at that moment, then into the logs for a specific trace ID. Once you have that, you stop guessing. The data tells you what happened.

The Lesson

Instrument first. Don’t wait until you have a performance problem to add observability. By then you’re flying blind in a crisis. Add the instrumentation while the code is being written, and you’ll never have to debug production by reading log lines and guessing.

Leave a Comment

Your email address will not be published. Required fields are marked *