Logging, Monitoring & Observability

01

Why Observability Matters

Modern backend applications run in distributed environments — multiple services, multiple servers, multiple regions, users spread across the globe. Something will go wrong. The question is not if, but when — and more importantly, how fast can you find out and fix it?

Without logging, monitoring, and observability, when an incident happens you are completely blind. You know something is wrong (users are complaining, conversion dropped), but you have no idea what, where, or why. Every minute of debugging in the dark is revenue lost and user trust eroded.

Important Framing

These are implemented on a spectrum — no company can honestly say they follow 100% of all observability practices. The goal is to continuously improve your coverage, not to reach some mythical perfect state. Don't be intimidated by the sheer number of tools and practices in this space.

These practices also require a collective effort. As a developer, you instrument your code. DevOps / SRE teams configure the infrastructure that collects, stores, and displays that data. Neither side alone is sufficient.

02

The Three Pillars of Observability

Observability theory defines three foundational pillars. A system is only fully observable when all three are in place. Each pillar answers a different question about your system's behaviour.

📋 LOGS

What happened?
Timestamped records of discrete events in your application — a request came in, a database query ran, an error occurred. Think of logs as your backend's diary.

📊 METRICS

What are the patterns?
Numerical measurements aggregated over time — request rate, error rate, latency p99, CPU usage. Metrics tell you trends, not individual events.

🔗 TRACES

Where did it go?
A trace follows a single request through every component it touched — handler → service → DB → external API. Shows component interactions and latency per step.

Logs → Answer

"User 42 tried to log in at 14:32:01 and failed — invalid password."

Metrics → Answer

"In the last 5 minutes, 800 requests/sec, 12% error rate, p99 latency 340ms."

Traces → Answer

"This request spent 240ms in the DB query layer and 80ms calling the email provider — that's where the latency is."

03

Monitoring vs Observability — The Difference

These terms are often used interchangeably but they are distinct concepts at different levels of capability.

Dimension	Monitoring	Observability
Core question	"Is something wrong?"	"What exactly is wrong and why?"
Approach	Pre-defined dashboards and alerts on known metrics	Ad-hoc investigation using logs + metrics + traces together
Scope	Tracks metrics you decided to measure in advance	Can answer questions you didn't know to ask when you set it up
Output	"CPU is at 95%" → alert fires	"CPU is 95% because function X is running an N+1 DB query for user segment Y"
Historical analogy	Dashboard on a car: speed, fuel, temp	Full diagnostic system: tells you the exact faulty component and cause
What it requires	Metrics collection, alert rules	Logs + Metrics + Traces all correlated and linked

Historical Context

A decade ago, monitoring was the primary form of error detection. You'd set up Nagios alerts on CPU and memory and call it a day. The observability movement emerged because in microservices/distributed systems, knowing that something failed is almost useless without knowing where across dozens of services the failure actually originated.

Key insight: Monitoring tells you there is a problem. Observability tells you exactly what the problem is — the specific function, the specific service, the specific user, the specific query. This is only possible if you've properly implemented all three pillars.

04

How All Three Work Together — The Debugging Workflow

In practice, when an incident occurs, you navigate from coarse-grained signals to fine-grained details. Here's the canonical flow used in production systems:

1
Alert fires (Monitoring)
Grafana / New Relic detects error rate > 80% and fires a Slack webhook. You get a message: "API service error rate spiked at 14:32." This is monitoring doing its job.
2
Go to Metrics dashboard
Open Grafana / New Relic. You see concrete numbers: 840 requests/min, 82% returning 5xx, p99 latency jumped from 120ms to 4.2s at 14:31. You now know the scale of the problem and when it started.
3
Correlate to Logs
From the metrics view, click through to the associated error logs for that 2-minute window. You see hundreds of logs: ERROR: connection pool exhausted after 4000ms. Now you know what is failing.
4
Follow the Trace
Click on one of those error logs. It links to a trace. The trace shows the full request journey: AuthMiddleware (2ms) → UserHandler (5ms) → UserService (3ms) → DB.GetUser (3990ms ← HERE). The DB query is timing out. Root cause found.
5
Fix & Confirm with Metrics
You add a missing index to the query. Within minutes, the Grafana dashboard shows error rate dropping back to 0.2%, latency back to 110ms. Monitoring confirms the fix worked.

The Power of Correlation

Alert → Metric → Log → Trace is only possible when all three pillars are correlated. Tools like Grafana, New Relic, and Datadog link them using a shared trace_id / request_id that flows through every component. This is why setting a request ID in middleware and passing it everywhere is critical.

05

Logging — Deep Dive

Logging is the practice of recording all important events throughout your application's execution lifecycle. Think of it as a journal your backend keeps — timestamped, detailed, structured entries for every significant thing that happens.

What Should You Log?

A good rule of thumb: anything a developer would want to know when debugging an issue at 2 AM — without access to the live system, without the ability to reproduce the bug, with only the logs to go on.

Business events

User created a to-do. Order was placed. Payment succeeded. Subscription cancelled. These are the events that matter to the business and to auditing.

Security events

Login attempt (success or failure). Password reset. Token revoked. Admin action. These are critical for security audits and intrusion detection.

System events

Server startup / shutdown. Database connection established / lost. Background job started / finished. Config loaded.

Errors & exceptions

Every unhandled error, every caught exception, every failed external API call. Always include: timestamp, user ID, request ID, error type, stack trace.

Performance signals

Slow queries (> 500ms), large payload responses, DB connection pool warnings. These are logs that become metrics over time.

What Metadata to Include in Every Log

A log without context is almost useless. Every log entry should carry enough metadata to answer: who did what, when, where, and with what result?

// A well-structured log entry (JSON, production format)
{
  "timestamp":   "2024-01-15T14:32:01.234Z",
  "level":        "error",
  "message":      "database query failed",
  "service":      "todo-api",
  "environment":  "production",
  "request_id":   "req_01J2KXYZ",    // unique per request
  "trace_id":     "abc123def456",    // links to distributed trace
  "span_id":      "span_789",
  "user_id":      "usr_01J2K",
  "method":       "POST",
  "path":         "/todos",
  "error":        "pq: deadlock detected",
  "duration_ms":  4001,
  "host":         "worker-node-3"
}
JSON Log Entry

06

Log Levels

Every log entry is assigned a severity level. This lets you filter what you see — in development you want everything, in production you care about warnings and above. Your logging library will let you set a minimum level per environment.

DEBUG

Verbose diagnostic info for development only. Variable values, function entry/exit, loop iterations. Never enable in production — the volume would overwhelm your log pipeline and cost a fortune in storage. Used when troubleshooting a specific issue locally.

INFO

Normal, successful application operations. Server started. User logged in. To-do created. Background job completed. These record the normal flow of business events. This is the default level for production.

WARN

Something unexpected, but not a failure. A user typed the wrong password (not your fault). An API responded slowly but successfully. A deprecated config option is still in use. Something you should investigate eventually but the system is still working.

ERROR

Something failed and needs attention. Database query failed. External API returned 500. Validation failed in a way that should not have happened. A task handler threw an unhandled exception. This is the level you build alerts around.

FATAL

The application cannot continue and is shutting down. Cannot connect to database on startup. Config file missing. Critical dependency unavailable. After logging FATAL, the process terminates. Infrastructure will restart it. Use sparingly — only for truly unrecoverable states.

Level Hierarchy

Levels form a hierarchy: DEBUG < INFO < WARN < ERROR < FATAL. Setting your logger to INFO means it emits INFO, WARN, ERROR, and FATAL — but suppresses DEBUG. Setting to ERROR means only ERROR and FATAL are emitted. This is how you control log volume per environment.

Environment-Specific Level Configuration (Go)

// logger/logger.go — Environment-aware log level
package logger

import (
    "os"
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

func getLogLevel() zapcore.Level {
    env := os.Getenv("APP_ENV")
    switch env {
    case "production":
        return zapcore.InfoLevel   // production: INFO and above
    case "staging":
        return zapcore.WarnLevel   // staging: WARN and above only
    default:
        return zapcore.DebugLevel  // local dev: everything
    }
}

func New() *zap.Logger {
    env := os.Getenv("APP_ENV")
    level := getLogLevel()

    var cfg zap.Config
    if env == "production" {
        // Production: JSON format — parseable by Loki, ELK, New Relic
        cfg = zap.NewProductionConfig()
        cfg.EncoderConfig.TimeKey = "timestamp"
        cfg.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
    } else {
        // Development: coloured console format — human-readable
        cfg = zap.NewDevelopmentConfig()
        cfg.EncoderConfig.EncodeLevel = zapcore.CapitalColorLevelEncoder
    }

    cfg.Level = zap.NewAtomicLevelAt(level)
    logger, _ := cfg.Build()
    return logger
}
Go / Zap

07

Structured vs Unstructured Logging

There are two fundamental ways to write a log entry. The choice of which to use depends on your environment.

Property	Unstructured (Plain Text)	Structured (JSON)
Format	`2024-01-15 14:32 ERROR: DB query failed for user 42`	`{"level":"error","user_id":"42","error":"pq: timeout",...}`
Human readability	✓ Easy to read in a terminal	✗ Harder to read without a viewer
Machine parseability	✗ Hard — requires regex to extract user_id, error message, etc.	✓ Native JSON — any tool can parse instantly
Field extraction	Error-prone regex patterns	Direct field access: `log.user_id`
Querying in Loki/ELK	Complex, fragile grep patterns	`{user_id="42"} \|= "error"` — trivial
Use in	Local development	Staging & Production

What Structured Logs Look Like Side by Side

// UNSTRUCTURED — development console (human-friendly)
14:32:01 DEBUG  Connected to PostgreSQL on localhost:5432
14:32:01 INFO   Started background job worker
14:32:01 INFO   HTTP server listening on :8080
14:32:14 WARN   Slow query detected: 520ms  query=SELECT * FROM todos WHERE user_id=$1
14:32:15 ERROR  Failed to send email  user_id=usr_01J2K  error=connection refused
Console (Dev)

// STRUCTURED — production JSON (machine-friendly)
{"timestamp":"2024-01-15T14:32:14Z","level":"warn","msg":"slow query","duration_ms":520,"query":"SELECT * FROM todos WHERE user_id=$1","service":"todo-api","env":"production"}
{"timestamp":"2024-01-15T14:32:15Z","level":"error","msg":"failed to send email","user_id":"usr_01J2K","error":"connection refused","trace_id":"abc123","span_id":"span_001","service":"todo-api","env":"production"}
JSON (Production)

Why JSON in Production?

Log management systems (Loki, ELK, New Relic, Datadog) ingest thousands of log lines per second. If logs are plain text, these tools must use regex to extract fields like user_id, error, duration_ms — which is slow, error-prone, and fragile. JSON lets them parse fields natively in O(1). Always use JSON in production.

08

Metrics — Deep Dive

Metrics are numerical measurements of your system's behaviour, aggregated over time. Unlike logs (one entry per event), a metric is a counter or gauge that summarises many events into a single number — e.g. "320 requests in the last 15 seconds". This makes them extremely efficient for dashboards and alerting.

The Four Core Metric Types (Prometheus Model)

Counter

A value that only ever goes up (and resets on restart). Used for: total requests served, total errors, total tasks processed. Query: rate(http_requests_total[5m]) = requests/sec over last 5 min.

Gauge

A value that can go up or down. Used for: current queue depth, current active DB connections, current goroutine count, memory usage right now.

Histogram

Records observations in configurable buckets. Used for: request latency, response size. Lets you compute percentiles (p50, p95, p99) — crucial for SLOs. "95% of requests complete under 200ms."

Summary

Similar to Histogram but calculates percentiles client-side. Less flexible for aggregation across multiple instances. Prefer Histograms in most cases.

Key Metrics Every Backend Should Expose

Metric	Type	Why It Matters
`http_requests_total`	Counter	Request throughput. Labels: method, path, status_code.
`http_request_duration_seconds`	Histogram	Latency distribution. p99 tells you worst-case user experience.
`http_errors_total`	Counter	Count of 4xx / 5xx responses. Spike = something broken.
`db_query_duration_seconds`	Histogram	Slow DB queries. Spikes indicate missing indexes or N+1 problems.
`db_connections_open`	Gauge	Pool exhaustion risk. Near max = you need more connections or fewer queries.
`task_queue_depth`	Gauge	Backlog in your background task system. Rising = workers need scaling.
`task_processing_duration_seconds`	Histogram	How long background tasks take. Useful for SLA tracking.
`go_goroutines` / `process_resident_memory_bytes`	Gauge	Runtime health. Goroutine leak = this keeps climbing.

Scrape Interval & Delay

Prometheus scrapes your /metrics endpoint at a configured interval (typically 15 seconds). This means there is a built-in delay of ~10–15 seconds between reality and what you see on a dashboard. This is acceptable for most use cases. For near-real-time alerting, use sub-15-second scrape intervals — but be aware this increases load on both your service and Prometheus.

09

Distributed Tracing

A trace is the complete record of a single request's journey through your system. Each step of that journey is called a span. Spans are linked by a shared trace_id that is generated when the request first enters your system and propagated through every subsequent call.

Trace vs Span — Definitions

Trace

The entire end-to-end journey of a single request. Identified by a globally unique trace_id. Contains all spans for that request.

Span

A single unit of work within a trace — one function call, one DB query, one HTTP call. Has a start time, duration, parent span ID, and arbitrary attributes (user_id, query, etc.).

Root Span

The first span in a trace. Typically the HTTP handler or the entry point of the request. Has no parent.

Context Propagation

The mechanism by which trace_id and span_id are passed from function to function (via Go's context.Context), and from service to service (via HTTP headers like traceparent).

traceparent header

W3C standard header: traceparent: 00-abc123-span001-01. When your service calls another service, it includes this header so the receiving service can join its spans to the same trace.

Why Traces Are Different from Logs

A log entry says "this error happened". A trace says "this error happened here, at this point in the request's journey, after these preceding steps, and it caused the total request to take 4.2 seconds instead of 50ms." Traces provide causal context that isolated logs cannot.

10

Instrumentation & OpenTelemetry

Instrumentation is the act of adding code to your application to measure its behaviour — recording spans, emitting metrics, enriching logs with context. You cannot observe what you haven't instrumented.

OpenTelemetry (OTel) is an open-source, vendor-neutral standard for instrumentation. It provides APIs, SDKs, and an agent (the Collector) for all major languages. The key value: instrument once, send to any backend (Jaeger, Grafana Tempo, New Relic, Datadog, Honeycomb) just by changing configuration — no code changes.

OTel Core Concepts

Tracer

The OTel object that creates spans. Each component (service, library) gets its own named tracer. otel.Tracer("todo-service")

Span

Created by the tracer. Has a name, start/end time, status, and key-value attributes. OTel spans automatically link to parent spans via context.

Context Propagation

OTel automatically injects/extracts traceparent headers in HTTP clients and servers (with the right libraries). Zero manual work for standard HTTP.

OTel Collector

A standalone agent/sidecar that receives telemetry from your app (via OTLP protocol), optionally processes/filters it, and exports to one or more backends. Decouples instrumentation from backend choice.

Auto-instrumentation

OTel provides libraries that auto-instrument popular frameworks (net/http, gin, gRPC, database/sql, redis, etc.) with zero code changes. You get traces for DB queries and HTTP calls for free.

11

Code: Go — Full LMO Implementation

A complete example showing logging (Zap), metrics (Prometheus), and tracing (OpenTelemetry) wired into a Go service handler. This is the pattern from the lecture's to-do application.

Logger Setup (Zap)

// logger/logger.go
package logger

import (
    "os"
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

var Log *zap.Logger

func Init() {
    env := os.Getenv("APP_ENV")
    var cfg zap.Config

    if env == "production" {
        cfg = zap.NewProductionConfig()          // JSON output
        cfg.Level = zap.NewAtomicLevelAt(zapcore.InfoLevel)
    } else {
        cfg = zap.NewDevelopmentConfig()         // coloured console
        cfg.Level = zap.NewAtomicLevelAt(zapcore.DebugLevel)
        cfg.EncoderConfig.EncodeLevel = zapcore.CapitalColorLevelEncoder
    }

    Log, _ = cfg.Build(
        // Always include these fields in every log entry
        zap.Fields(
            zap.String("service", "todo-api"),
            zap.String("env", env),
        ),
    )
}
Go / Zap

Tracing Middleware (OpenTelemetry)

// middleware/tracing.go — Creates a trace span per request
package middleware

import (
    "github.com/gin-gonic/gin"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.uber.org/zap"
    "myapp/logger"
)

func EnhancedTracing() gin.HandlerFunc {
    tracer := otel.Tracer("todo-api")

    return func(c *gin.Context) {
        // Extract incoming trace context (from load balancer / front-end)
        ctx := otel.GetTextMapPropagator().Extract(
            c.Request.Context(),
            propagation.HeaderCarrier(c.Request.Header),
        )

        // Start root span for this request
        ctx, span := tracer.Start(ctx, c.FullPath())
        defer span.End()

        // Attach metadata to span (visible in trace UI)
        span.SetAttributes(
            attribute.String("http.method", c.Request.Method),
            attribute.String("http.path", c.FullPath()),
            attribute.String("http.user_agent", c.Request.UserAgent()),
            attribute.String("user.id", getUserID(c)),
        )

        // Inject context so downstream handlers can create child spans
        c.Request = c.Request.WithContext(ctx)
        c.Next()

        // After handler returns: record status code on span
        statusCode := c.Writer.Status()
        span.SetAttributes(attribute.Int("http.status_code", statusCode))
        if statusCode >= 500 {
            span.SetStatus(codes.Error, "server error")
        }

        // Structured log entry linking to this trace
        logger.Log.Info("request completed",
            zap.String("method", c.Request.Method),
            zap.String("path", c.FullPath()),
            zap.Int("status", statusCode),
            zap.String("trace_id", span.SpanContext().TraceID().String()),
        )
    }
}
Go / Gin + OTel

Service Layer — Logging + Tracing + Error Handling

// services/todo_service.go — The full LMO pattern per function
package services

import (
    "context"
    "fmt"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/codes"
    "go.uber.org/zap"
    "myapp/logger"
    "myapp/models"
)

var tracer = otel.Tracer("todo-service")

func CreateTodo(ctx context.Context, req models.CreateTodoRequest, userID string) (*models.Todo, error) {
    // 1. Start a child span (parent span is in ctx from the middleware)
    ctx, span := tracer.Start(ctx, "TodoService.CreateTodo")
    defer span.End()

    // 2. Add business context to the span — visible in trace UI
    span.SetAttributes(
        attribute.String("user.id", userID),
        attribute.String("todo.title", req.Title),
        attribute.String("todo.priority", req.Priority),
    )

    // 3. INFO log — record business event start
    logger.Log.Info("creating todo",
        zap.String("user_id", userID),
        zap.String("title", req.Title),
        zap.String("trace_id", span.SpanContext().TraceID().String()),
    )

    // 4. Execute DB operation (inside its own child span — auto via otelgorm)
    todo, err := repo.CreateTodo(ctx, req, userID)
    if err != nil {
        // ERROR log with full context
        logger.Log.Error("failed to create todo",
            zap.String("user_id", userID),
            zap.String("error", err.Error()),
            zap.String("trace_id", span.SpanContext().TraceID().String()),
        )
        // Record error on span — shows as failed in Jaeger/Grafana Tempo
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, fmt.Errorf("repo.CreateTodo: %w", err)
    }

    // 5. DEBUG log — only visible in dev, suppressed in production
    logger.Log.Debug("todo created",
        zap.String("todo_id", todo.ID),
    )

    // 6. INFO — business event completion log (for audit trail)
    logger.Log.Info("todo created successfully",
        zap.String("todo_id", todo.ID),
        zap.String("user_id", userID),
        zap.String("title", todo.Title),
        zap.String("priority", todo.Priority),
        zap.String("category_id", todo.CategoryID),
    )

    return todo, nil
}
Go / Zap + OTel

Prometheus Middleware (Metrics)

// middleware/metrics.go — HTTP metrics instrumentation
package middleware

import (
    "strconv"
    "time"
    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    }, []string{"method", "path", "status"})

    httpDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency",
        Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5},
    }, []string{"method", "path"})
)

func PrometheusMetrics() gin.HandlerFunc {
    return func(c *gin.Context) {
        start := time.Now()
        c.Next()
        duration := time.Since(start)

        status := strconv.Itoa(c.Writer.Status())
        httpRequestsTotal.WithLabelValues(c.Request.Method, c.FullPath(), status).Inc()
        httpDuration.WithLabelValues(c.Request.Method, c.FullPath()).Observe(duration.Seconds())
    }
}
Go / Gin + Prometheus

12

Code: Python — Full LMO Implementation

Structured Logging with structlog

# logging_config.py — structlog setup for Python
import logging
import os
import structlog

def configure_logging():
    env = os.getenv("APP_ENV", "development")

    if env == "production":
        # JSON renderer for production — parseable by Loki / ELK
        renderer = structlog.processors.JSONRenderer()
        level = logging.INFO
    else:
        # ColourfulConsole for local dev — human-readable
        renderer = structlog.dev.ConsoleRenderer(colors=True)
        level = logging.DEBUG

    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,      # merge request-scoped context
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.StackInfoRenderer(),
            structlog.processors.ExceptionRenderer(),     # auto-renders exceptions
            renderer,
        ],
        logger_factory=structlog.stdlib.LoggerFactory(),
        wrapper_class=structlog.stdlib.BoundLogger,
        cache_logger_on_first_use=True,
    )

    logging.basicConfig(level=level)

log = structlog.get_logger()
Python / structlog

OpenTelemetry Tracing in FastAPI

# tracing.py — OTel setup for FastAPI
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor

def setup_tracing(app):
    # Tracer provider — sends spans to OTel Collector via gRPC
    provider = TracerProvider()
    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI — all routes get spans automatically
    FastAPIInstrumentor.instrument_app(app)

    # Auto-instrument SQLAlchemy — all DB queries get spans automatically
    SQLAlchemyInstrumentor().instrument(engine=engine)

# main.py — wire it all up
from fastapi import FastAPI
from tracing import setup_tracing
from logging_config import configure_logging, log

configure_logging()
app = FastAPI()
setup_tracing(app)
Python / FastAPI + OTel

Service Layer with Full LMO Pattern

# services/todo_service.py
from opentelemetry import trace
import structlog

tracer = trace.get_tracer("todo-service")
log = structlog.get_logger()

async def create_todo(user_id: str, req: CreateTodoRequest) -> Todo:
    # 1. Start child span
    with tracer.start_as_current_span("TodoService.create_todo") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("todo.title", req.title)

        # 2. INFO log — business event start
        log.info("creating_todo", user_id=user_id, title=req.title)

        try:
            todo = await repo.create_todo(user_id, req)

            # 3. Business event log — audit trail
            log.info("todo_created",
                todo_id=todo.id,
                user_id=user_id,
                title=todo.title,
                priority=todo.priority,
            )
            return todo

        except Exception as e:
            # 4. ERROR log + span error recording
            log.error("todo_creation_failed", user_id=user_id, error=str(e))
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
            raise
Python / FastAPI + OTel + structlog

13

Tooling: The Open Source Stack

The open-source observability stack is what most large enterprises run. More control, no vendor lock-in, no per-seat pricing — but requires operational expertise to maintain.

Tool	Pillar	Role
Prometheus	Metrics	Scrapes `/metrics` endpoints every 15s, stores time-series data, evaluates alert rules. The de facto standard metrics backend.
Grafana	Dashboards	Visualization layer. Queries Prometheus, Loki, Tempo, and others. Build dashboards, set alert thresholds, create on-call runbooks.
Loki	Logs	Log aggregation system by Grafana Labs. Unlike ELK, it indexes only metadata (labels), not full-text — much cheaper. Queryable via LogQL. Push-based via Promtail agent.
Promtail	Logs (agent)	Runs on each server, tails log files or Docker/Kubernetes log streams, and ships them to Loki. Zero-code change required.
Grafana Tempo	Traces	Distributed tracing backend. Receives OTel spans, stores them cheaply (object storage), queryable from Grafana. Replaces Jaeger for new setups.
Jaeger	Traces	Older distributed tracing system (CNCF). Still widely used. UI for navigating traces, service dependency graphs. Now recommends migrating to OTel SDK.
Alertmanager	Alerting	Routes Prometheus alerts to Slack, PagerDuty, email, etc. Handles deduplication, grouping, and silencing of alerts.
OTel Collector	All three	Receives all telemetry from apps, processes/enriches it, and fans out to Prometheus + Loki + Tempo. Central telemetry pipeline.

Full Stack Architecture

14

Tooling: Proprietary Solutions

If you don't have the team to manage the open-source stack, proprietary all-in-one solutions are the pragmatic choice. They bundle logging, metrics, tracing, and alerting into a single platform with managed storage and a polished UI.

Tool	Strengths	Best For	Pricing Model
New Relic	Full-stack observability, Go/Python/Node agents, APM, browser monitoring, mobile. Easy 10-minute setup.	Teams that want one tool with minimal ops overhead	Free tier (100GB/month), then per GB ingested
Datadog	Industry-leading dashboards, 500+ integrations, excellent APM, infra monitoring, security. Very powerful.	Enterprises with complex infra — Kubernetes, multi-cloud	Per host + per GB — can get expensive quickly
Sentry	Best-in-class error tracking. Groups identical errors, shows stack traces, user context, release tracking.	Error monitoring specifically — pairs well with Grafana for metrics	Free tier (5k errors/month), then per event
Honeycomb	High-cardinality event querying. Built for observability-first teams. Native OTel support.	Teams doing advanced distributed tracing, many microservices	Per event ingested

New Relic Go Agent — Quick Integration

// main.go — New Relic agent setup in Go
import (
    "github.com/newrelic/go-agent/v3/newrelic"
    "github.com/newrelic/go-agent/v3/integrations/nrgin"
)

func main() {
    // Initialize New Relic application agent
    nrApp, _ := newrelic.NewApplication(
        newrelic.ConfigAppName("todo-api"),
        newrelic.ConfigLicense(os.Getenv("NEW_RELIC_LICENSE_KEY")),
        newrelic.ConfigAppLogForwardingEnabled(true),  // forward logs to NR
        newrelic.ConfigDistributedTracerEnabled(true), // enable distributed tracing
    )

    r := gin.New()
    // nrgin middleware — auto-instruments every route with NR transactions
    r.Use(nrgin.Middleware(nrApp))

    r.Run(":8080")
}
Go / New Relic

What New Relic shows you

After adding the middleware, New Relic automatically captures: transaction time per route, error rate, throughput (req/sec), Apdex score, Go runtime metrics (GC pause, goroutine count, memory), and distributed traces. The dashboard shows all of this within minutes of deploying.

15

ELK Stack — Elasticsearch, Logstash, Kibana

The ELK Stack (now called the Elastic Stack) is the classic enterprise logging platform. It provides full-text search across logs, powerful aggregations, and rich dashboards. It's heavier than Loki but offers more querying flexibility.

Elasticsearch

The storage and search engine. Indexes every field of every log document. Supports complex queries: full-text search, aggregations, range filters. Horizontally scalable.

Logstash

The log pipeline. Ingests logs from many sources (files, Kafka, Redis, Beats agents), transforms/filters them (parse JSON, add geo-IP, mask PII), and outputs to Elasticsearch. Powerful but resource-heavy.

Kibana

The visualization layer. Build dashboards, run queries against Elasticsearch (KQL language), create alerts. The UI equivalent of Grafana for Elastic data.

Filebeat / Fluentd

Lightweight log shipping agents that run on your servers and ship log files to Logstash or Elasticsearch directly. Lower resource overhead than Logstash as a local agent.

ELK vs Grafana Loki

Property	ELK Stack	Grafana Loki
Indexing	Full-text indexes every field → powerful queries	Only indexes labels (metadata) → cheaper storage
Storage cost	High — indexing all fields is expensive	Low — like Prometheus for logs
Query power	Very high — KQL, complex aggregations	Good for label-based queries (LogQL)
Integration	Self-contained Kibana UI	Grafana — one UI for metrics + logs + traces
Best for	High query complexity, security/compliance use cases	Cost-sensitive setups already using Grafana stack

16

Alert Design — What to Alert On

Not every metric needs an alert. Alert fatigue is a real problem — if engineers receive too many notifications, they start ignoring them. Good alert design means alerting on symptoms that impact users, not internal implementation details.

The Four Golden Signals (Google SRE Book)

📈

Latency

Time to serve a request. Distinguish between successful and failed requests — a fast error is not a good metric. Alert on p99 > threshold.

🚦

Traffic

Demand on your system — requests/sec. Sudden drops can indicate an outage just as much as sudden spikes. Alert on both extremes.

🔥

Errors

Rate of failed requests (5xx). Alert on error rate, not raw count — a spike in traffic will naturally cause more errors even if the error rate is stable.

⚡

Saturation

How "full" is your service — CPU, memory, DB connections, queue depth. Alerts before the system tips into failure. Alert at 80%, page at 95%.

Alert Rules (Grafana / Prometheus format)

# alert_rules.yml — Production-grade alert rules
groups:
  - name: api_health
    rules:

      # P1 — Error rate critical
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) * 100 > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "HTTP error rate above 5% ({{ $value | printf \"%.1f\" }}%)"
          runbook: "https://wiki.company.com/runbooks/high-error-rate"

      # P2 — High latency
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 latency above 2s ({{ $value | printf \"%.2f\" }}s)"

      # P2 — DB connection pool near exhaustion
      - alert: DBConnectionPoolExhaustion
        expr: db_connections_open / db_connections_max > 0.85
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "DB connection pool at {{ $value | humanizePercentage }}"

      # P1 — Service is down
      - alert: ServiceDown
        expr: up{job="todo-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "todo-api service is unreachable"
YAML / Prometheus Rules

Alert Severity Levels

Define at least two levels: warning (investigate when convenient, business hours) and critical (page on-call immediately, any hour). Waking engineers at 3 AM for a warning-level issue is how you burn out your team and teach them to ignore alerts.

17

Production Debugging Workflow — Full Example

Here is what the complete workflow looks like end-to-end when an incident occurs in a system with full LMO in place.

Scenario: API Suddenly Slow + High Errors

1
🔔 Alert fires at 14:31 UTC
Alertmanager sends Slack message: "[CRITICAL] HighErrorRate: HTTP error rate is 82.4% (was 0.3%). Service: todo-api. Runbook: wiki/…"
2
📊 Open Grafana → Metrics dashboard
P99 latency jumped from 120ms to 4.3s at 14:31. Requests/sec unchanged (not a traffic spike). DB connection pool at 97%. Error type: 100% are 500s, not 4xx. Indicates server-side DB problem.
3
📋 Correlate to Logs in Loki / New Relic
Query: {service="todo-api", level="error"} |= "connection". Hundreds of entries: ERROR: pq: remaining connection slots are reserved for non-replication superuser connections. DB connection pool is exhausted.
4
🔗 Jump to Trace from log entry
Click trace link in the log. Trace shows: HTTP Handler → AuthMiddleware (3ms) → TodoService (4290ms) → DB.GetTodos (4285ms ← STUCK). DB.GetTodos span has attribute: query="SELECT * FROM todos WHERE user_id=$1".
5
🔍 Root cause identified
A deployment 5 minutes ago added a SELECT * that performs a full table scan (missing index). Under load, queries pile up, connections are held longer, pool exhausts. Check git log — confirms deployment at 14:29 UTC.
6
✅ Fix + verify
Add missing index migration. Deploy. Grafana shows error rate drops to 0.1% within 2 minutes of deployment. DB connection pool back to 12%. Incident resolved. Total time: 11 minutes.

Without LMO

The same incident without observability: users complain on Twitter, you SSH into servers running grep ERROR /var/log/app.log, manually correlate timestamps, eventually realize something is wrong with the DB, spend 90 minutes narrowing down which query, then guess about the missing index. Total time: 2–3 hours minimum.

18

Best Practices

1 · Always Include a Request ID

Generate a unique request_id (UUID or ULID) at the entry point of every request — either in your load balancer or in your first middleware. Inject it into the context and include it in every log entry and span for that request. This is what lets you pull all logs for a single failing request out of millions of entries.

2 · Log at Boundaries, Not Inside Logic

Log at the entry and exit of major components (handler, service, repository), not scattered throughout every loop and condition. This gives you a clean, readable trail without noise. Inside functions, use spans for granular timing — not log.Debug spam.

3 · Never Log Sensitive Data

Passwords, credit card numbers, SSNs, auth tokens, and API keys must never appear in logs. Use structured logging libraries that let you mark fields as redacted. In your CI pipeline, consider a log scanning step that fails the build if certain patterns are found in log statements.

4 · Use Sampling for High-Volume Traces

At high traffic (1000+ req/sec), recording every trace is expensive. Use head-based sampling (record 10% of all requests) or tail-based sampling (record 100% of error requests + 1% of success requests). The OTel Collector supports both. Never sample errors — always record traces for failures.

5 · Log Levels Discipline

Never use DEBUG in production — it generates enormous volume and costs storage
INFO should be your production default — meaningful events, not chatty
Treat every ERROR log as something that needs a ticket or an alert rule
FATAL should mean "I need someone paged right now"

6 · This is a Spectrum, Not a Checkbox

Start with basic structured logging in JSON format + a simple Prometheus counter for error rate. That alone will save you hours when the first production incident happens. Then iteratively add: histogram metrics, basic tracing, Grafana dashboards, alert rules. You don't need the full ELK + Jaeger + OTel Collector stack on day one. Build it incrementally.

Recap

Logs tell you what happened. Metrics tell you patterns and trends. Traces tell you where exactly things went wrong and how long each step took. A system is observable when all three are in place and correlated. This requires developer effort (instrument your code) and infrastructure effort (collect, store, display) working together. The tooling — whether Grafana stack or New Relic — is secondary to the practice of consistently instrumenting your code with the right context.