Observability: X-Ray Vision for Distributed Systems
Why logs are not enough. A deep technical guide to OpenTelemetry, Structured Logging, and SRE Alerting philosophy (SLI/SLO).
“The site is slow.”
This four-word ticket is the bane of every engineering team. Slow where? Is the database cpu-locked? Is the mobile network flakey? Is the third-party shipping API timing out? Did we deploy bad code yesterday?
If your debugging strategy relies on console.log and checking server CPU graphs, you are debugging in the dark.
Monitoring tells you the system is broken (Red Dashboard).
Observability tells you why it broke.
At Maison Code Paris, we operate high-traffic commerce platforms where 1 minute of downtime equals tens of thousands of euros. We do not guess. We use code instrumentation to gain X-Ray vision into the distributed stack.
Why Maison Code Discusses This
At Maison Code Paris, we act as the architectural conscience for our clients. We often inherit “modern” stacks that were built without a foundational understanding of scale. We see simple APIs that take 4 seconds to respond because of N+1 query problems, and “Microservices” that cost $5,000/month in idle cloud fees.
We discuss this topic because it represents a critical pivot point in engineering maturity. Implementing this correctly differentiates a fragile MVP from a resilient, enterprise-grade platform that can handle Black Friday traffic without breaking a sweat.
The Three Pillars of Observability
To understand a complex system, you need three distinct types of telemetry data:
- Logs: Discrete events. “Something happened at 10:00:01.”
- Metrics: Aggregated numbers. “CPU was 80% for 5 minutes.”
- Traces: The lifecycle of a request. “User clicked button -> API -> DB -> API -> UI.”
1. Logs: Structured & Searchable
Developers love console.log("Error user not found " + id).
DevOps hate it.
In a system producing 1,000 logs per second, searching for “Error user” is slow. Analyzing it is impossible. You cannot graph “Error rate per user ID.”
The Golden Rule: Logs must be Structured JSON.
{
"level": "error",
"message": "Payment Gateway Timeout",
"timestamp": "2025-10-23T14:00:00Z",
"service": "checkout-api",
"context": {
"userId": "u_123",
"cartId": "c_456",
"gateway": "stripe",
"latency": 5002
},
"traceId": "a1b2c3d4e5f6"
}
Now, in tools like Datadog or ELK (Elasticsearch), you can run queries:
service:checkout-api @context.gateway:stripe @level:error- “Show me a graph of payment errors grouped by Gateway.”
2. Metrics: The Pulse
Metrics are cheap to store. They are just numbers.
- Counters: “Total Requests” (Goes up).
- Gauges: “Memory Usage” (Goes up and down).
- Histograms: “Latency Distribution” (95% of requests < 200ms).
The Cardinality Trap:
Junior engineers often try to tag metrics with high-cardinality data.
http_request_duration_seconds{user_id="u_123"}.
If you have 1 million users, you create 1 million time-series. This will crash your Prometheus server or bankrupt your Datadog account.
Rule: Metrics are for System Health. Logs/Traces are for User Specifics.
3. Tracing: The Context
In a Microservices or Headless architecture, a single user click hits: CDN -> Edge Function -> Node API -> Database -> 3rd Party API. If the request takes 3 seconds, a Metric says “Latency High”. A Trace shows you:
- Node API: 10ms
- Database: 5ms
- Shipping API: 2900ms
Tracing solves the “Blame Game”.
OpenTelemetry: The Industry Standard
Historically, you had to install the Datadog SDK, or the New Relic SDK. If you wanted to switch vendors, you had to rewrite code.
OpenTelemetry (OTel) is the open-standard (CNCF) for collecting telemetry.
You instrument your code once with OTel. You can then pipe that data to Datadog, Prometheus, Jaeger, or just stdout.
Node.js Auto-Instrumentation
Modern OTel allows “Auto-Instrumentation”. You don’t rewrite your Express handlers. It patches standard libraries (http, pg, redis) at runtime.
// instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'https://api.honeycomb.io/v1/traces', // Or your collector
}),
instrumentations: [
new HttpInstrumentation(), // Traces all incoming/outgoing HTTP
new PgInstrumentation(), // Traces all Postgres Queries
],
serviceName: 'maison-code-api',
});
sdk.start();
With this 20-line file, you instantly have Waterfall charts for every database query and API call in your production system.
Trace Context Propagation
How does the Database know which “User Click” caused the query?
Context Propagation.
When Service A calls Service B, it injects a generic HTTP header: traceparent.
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
Service B reads this header. It starts its own span, using the ID from A as the “Parent ID”. This links the distributed operations into a single DAG (Directed Acyclic Graph).
RUM (Real User Monitoring)
Server-side observability works perfectly, yet users still complain. “I clicked the button and nothing happened.” Server Logs: “No request received.” Conclusion: The JavaScript crashed in the browser before sending the request.
For frontend observability, we use RUM (Sentry, LogRocket). It automatically captures:
- Core Web Vitals: Largest Contentful Paint (LCP), Interaction to Next Paint (INP).
- Console Errors:
undefined is not a function. - Session Replay: A video-like recording of the user’s mouse movements.
Case Study: We had a checkout bug with 0 logs.
Watching Session Replay, we saw users on iPhone SE (small screen) couldn’t scroll to the “Pay” button because a “Cookie Banner” was covering it with .zIndex: 9999.
No log could have told us that. Visual Observability did.
Alerting Philosophy: SRE Principles
Don’t page the engineer at 3 AM because “CPU is 90%”. Maybe 90% is fine. Maybe the queue is processing fast. Alert on symptoms, not causes.
SLI (Service Level Indicator): A metric that matters to the user.
- Example: “Homepage Availability”.
- Example: “Checkout Latency”.
SLO (Service Level Objective): The goal.
- “Homepage should be 200 OK 99.9% of the time.”
- “Checkout should be under 2s 95% of the time.”
Alert Condition:
- “If Error Rate > 0.1% for 5 minutes -> PagerDuty.”
- “If Error Rate > 5% -> Wake up the CEO.”
The Cost of Observability
Observing everything is expensive. If you trace 100% of requests, your monitoring bill might exceed your hosting bill. Sampling is the solution.
- Head-Based Sampling: Decide at the start of a request. “Trace 10% of users.”
- Tail-Based Sampling (Advanced): Keep all traces in memory. If an error occurs, send the full trace. If it succeeds, discard it.
- This ensures you capture 100% of errors but 0% of boring success logs.
10. Business Logic Monitoring (The Golden Signals)
CPU graphs don’t pay the bills. Orders pay the bills. Access logs don’t tell you if the “Add to Cart” button is broken (if it’s broken, there are no logs). We define Business Metrics:
orders_per_minuteadd_to_cart_value_totalpayment_gateway_rejection_rateWe set “Anomaly Detection” alerts. “If Orders/Min drops by 50% vs last Tuesday -> PagerDuty.” This catches “Silent Failures” (e.g., a broken Javascript event listener) that server logs will never see.
11. Cost Control strategies
Datadog is expensive.
Storing every console.log is wasteful.
We implement Log Levels by Environment:
Development: Debug (Everything).Production: Info (Key Events) + Error. We also use Exclusion Filters at the agent level: “Drop all logs containingHealth Check OK”. This reduces volume by 40% without losing signal.
12. The Cardinality Explosion (How to bankrupt observability)
It starts innocent.
metrics.increment('request_duration', { url: req.url })
Then a bot hits /login?random=123.
Then /login?random=456.
Suddenly, you have 1 million unique tag values.
Datadog charges you for “Custom Metrics”.
Bill: $15,000 / month.
Fix: Normalize your dimensions.
Replace /login?random=123 with /login?random=* in your middleware before emitting the metric.
13. Logging ROI: Is this log worth $0.001?
We treat logs like real estate. “User logged in” -> High Value. Keep for 30 days. “Health check passed” -> Low Value. Drop immediately. “Debug array dump” -> Negative Value. It clutters the search. We implement Dynamic Sampling: During an incident, turn DEBUG on for 1% of users. During peace time, keep it off.
14. Conclusion
Observability is the difference between “I think it works” and “I know it works.” It shifts the culture from “Blame” (“The network team failed”) to “Data” (“The trace shows a 500ms timeout on the switch”).
In a modern, distributed, high-stakes environment, code without instrumentation is barely code at all. It’s a black box waiting to confuse you.
Can you
If you are debugging with console.log, you are losing money.