Error Handling & Fault-Tolerant Systems
Errors are not exceptional โ they are inevitable. Database queries fail, external APIs time out, users send bad data, and business logic hits edge cases nobody predicted. This chapter builds the mindset and concrete toolkit for detecting, handling, and recovering from every class of backend error โ before it silently costs you money or trust.
The Fault-Tolerant Mindset
"The question is not whether errors will happen โ it is how you will handle them when they do."
Every backend engineer must internalise a simple truth: your system will fail. Not might. Will. The sources are everywhere:
- Database queries will occasionally fail or time out.
- External APIs (payments, email, auth) will go down.
- Users will send malformed, missing, or malicious data.
- Business logic will hit edge cases no one thought of during design.
A fault-tolerant system is not one that never breaks. It is one that breaks predictably, recovers gracefully, and tells you exactly what happened. Achieving that requires a deliberate mindset shift from "I'll handle errors later" to "I'll design for failure from day one."
The Five Classes of Backend Errors
Backend errors can be grouped into five broad categories. Each has a different origin, detection strategy, and fix.
App runs but produces wrong results. Hardest to detect. Can silently drain money for weeks.
Connection failures, deadlocks, constraint violations, malformed SQL. Can bring the whole app down.
Third-party APIs (payment, email, auth) time out, rate-limit, or go offline. You have no control.
Users send bad, missing, or out-of-range data. Easiest to handle โ if your validation layer is robust.
Missing env vars, wrong credentials on deploy. Surface at startup โ not at runtime, if you do it right.
Logic Errors โ The Silent Killers
Logic errors are the most dangerous class because your application keeps running โ it just does the wrong thing. No crash, no stack trace, no 500 response. Just quietly wrong results accumulating over time.
Classic Example
An e-commerce platform applies a discount twice due to a bug in the promotion engine. The result: negative shipping costs. The app runs perfectly. Every order ships. The company loses money on every transaction. This goes unnoticed for weeks because no monitoring alert fires on "negative shipping cost."
Common Root Causes
- Misunderstood requirements โ notes from a sprint meeting were ambiguous; you implemented what you thought was asked, not what was intended.
- Incorrect algorithms โ a complex discount or pricing formula has an off-by-one error or a wrong operator (
*instead of+). - Unhandled edge cases โ a user who has never purchased before triggers a "past-purchase-based" discount path that wasn't designed for zero-purchase users.
Prevention Strategies
- Write unit tests for every business rule, especially discount, pricing, and permission logic.
- Add business metric monitoring (e.g., alert if average order value drops by >20% in one hour).
- Use property-based testing (Go:
gopter, Python:hypothesis) to auto-generate edge-case inputs. - Require peer review for all payment and auth-related code changes.
Database Errors
Most backend applications are meaningless without their database. A database error of any kind means your app cannot serve real data โ which usually means a broken UI or cascading failures across services.
โ Connection Errors
Your backend cannot reach the database server. Possible causes:
- Network partition or DNS failure between app server and DB server.
- Database server is overloaded or down.
- Connection pool exhausted โ all pooled TCP connections are in use; new requests queue up or fail immediately.
pgxpool (Go), psycopg2 pool or SQLAlchemy (Python), pg Pool (Node). Size your pool carefully: too small โ bottleneck; too large โ DB overload.โก Constraint Violation Errors
You are trying to perform an operation that violates a database-level rule:
| Constraint Type | Trigger | Appropriate Response |
|---|---|---|
| Unique | Insert a duplicate email / username | HTTP 409 Conflict or 400 โ "Email already in use" |
| Foreign Key | Reference a row that doesn't exist | HTTP 404 โ "Author ID not found" / 400 |
| Not Null | Missing required column value | HTTP 400 โ "Field X is required" |
| Check | Value fails a custom rule (e.g. price > 0) | HTTP 400 โ domain-specific message |
โข Query / Syntax Errors
Malformed SQL โ a table name typo, referencing a column that was renamed, or a missing join condition. These are usually caught in development but can slip through if raw SQL strings are built dynamically.
โฃ Deadlocks
A deadlock occurs when two (or more) transactions each hold a lock that the other needs:
Postgres detects deadlocks automatically and kills one transaction with error code 40P01. Your application must retry that transaction. Prevention: always acquire locks in a consistent order across all code paths.
External Service Errors
Modern SaaS backends depend on a constellation of third-party services โ payment processors (Stripe), email (Resend, SendGrid), object storage (S3), auth (Clerk, Auth0), AI (OpenAI). Every one of these is a point of failure outside your control.
โ Network Failures
The internet between your server and the external API is unreliable. You will encounter: connection timeouts, DNS resolution failures, network partitions, and TLS handshake errors. Set explicit timeouts on every outgoing HTTP call โ never let a slow third-party API block your goroutine / thread indefinitely.
โก Rate Limiting โ HTTP 429
Every serious external API enforces rate limits to prevent abuse. If your app hammers an API (due to a bug, a traffic spike, or a loop error), you will receive HTTP 429 Too Many Requests.
The standard mitigation is Exponential Backoff with Jitter:
โข Service Outage / Downtime
Major cloud providers (AWS, GCP) and popular SaaS services go down occasionally. Your app needs a strategy for when a critical dependency is completely unavailable:
- Fallback โ if Redis cache is down, fall back to direct DB reads for non-critical data.
- Graceful degradation โ disable the affected feature (e.g., "AI suggestions temporarily unavailable") rather than crashing the whole app.
- Circuit breaker pattern โ after N consecutive failures, stop sending requests to the broken service and return a cached/default response immediately. Re-try the service after a cool-down period.
Input Validation Errors
These are the easiest errors to handle because you define the rules. Your validation layer is the first line of defence: catch bad data at the entry point, before it reaches your database or business logic.
Types of Validation
| Type | What It Checks | Example |
|---|---|---|
| Format | Shape/pattern of the value | Email regex, ISO date, E.164 phone |
| Range | Numeric bounds, string length, array size | Price: 0โ99999, name: 2โ100 chars |
| Required | Mandatory field present | user_id must not be null |
| Business Rule | Domain-specific constraint | Booking end_date > start_date |
| Referential | Related entity actually exists | category_id exists in categories table |
Always validate at both layers: frontend (UX) and backend (security). Never trust client-side validation alone. The backend is the authoritative gate.
400 Bad Request with a structured error body listing every field that failed and why. Don't return a single generic message โ help the user fix all their mistakes in one round-trip.Configuration Errors
Configuration errors happen at the boundary between environments โ dev โ staging โ production. A missing OPENAI_API_KEY, a wrong database URL, or a forgotten secret can silently break specific features while the rest of the app appears healthy.
Fail Fast at Startup โ Not at Runtime
The golden rule: validate all required environment variables before the server starts accepting traffic. If any are missing or corrupt, crash immediately with a clear error message.
โ Bad โ Runtime Failure
- App starts successfully
- First user hits the AI image endpoint
- OpenAI call fails โ key is missing
- User gets a mysterious 500 error
- Old deployment is already stopped
- Site is down until manually fixed
โ Good โ Startup Failure
- New deployment starts
- Config validation runs immediately
- Missing key detected โ process exits with clear message
- Blue-green: old deployment still running
- Zero downtime โ ops team fixes and redeploys
Go โ Config Validation at Boot
Gopackage config import ( "fmt" "os" "strings" ) type Config struct { DatabaseURL string OpenAIKey string JWTSecret string ResendAPIKey string } // MustLoad panics if any required variable is missing. // Call this once in main() before http.ListenAndServe. func MustLoad() Config { required := []string{ "DATABASE_URL", "OPENAI_API_KEY", "JWT_SECRET", "RESEND_API_KEY", } var missing []string for _, key := range required { if os.Getenv(key) == "" { missing = append(missing, key) } } if len(missing) > 0 { // Crash immediately โ loud and clear panic(fmt.Sprintf("[FATAL] missing required env vars: %s", strings.Join(missing, ", "))) } return Config{ DatabaseURL: os.Getenv("DATABASE_URL"), OpenAIKey: os.Getenv("OPENAI_API_KEY"), JWTSecret: os.Getenv("JWT_SECRET"), ResendAPIKey: os.Getenv("RESEND_API_KEY"), } } // main.go func main() { cfg := config.MustLoad() // panics here if config invalid server := newServer(cfg) log.Fatal(server.ListenAndServe()) }
Proactive Error Detection โ Health Checks
"The best error handling starts before the error happens."
Health checks continuously verify that your system is working โ not just that it is running. There is a critical difference:
What to Check
- Database โ run a lightweight representative query. Track query time; if it jumps from 50ms to 4s, something is wrong before users notice.
- External services โ payment processors: run periodic test transactions; email: send to an internal address; auth: generate and validate a test token.
- Configuration โ verify all required env vars are loaded and non-empty at startup.
- Cache warmup โ ensure critical caches (session store, product catalogue) are populated before serving traffic.
Go โ Deep Health Check Endpoint
Gotype HealthStatus struct { Status string `json:"status"` Checks map[string]string `json:"checks"` } func healthHandler(db *pgxpool.Pool, rdb *redis.Client) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { checks := map[string]string{} overall := "ok" // DB check ctx, cancel := context.WithTimeout(r.Context(), 2*time.Second) defer cancel() if err := db.Ping(ctx); err != nil { checks["database"] = "unhealthy: " + err.Error() overall = "degraded" } else { checks["database"] = "ok" } // Redis check if err := rdb.Ping(r.Context()).Err(); err != nil { checks["cache"] = "unhealthy: " + err.Error() overall = "degraded" } else { checks["cache"] = "ok" } status := http.StatusOK if overall != "ok" { status = http.StatusServiceUnavailable } w.Header().Set("Content-Type", "application/json") w.WriteHeader(status) json.NewEncoder(w).Encode(HealthStatus{Status: overall, Checks: checks}) } }
Monitoring & Observability
Health checks tell you something is broken right now. Monitoring tells you something is about to break โ and gives you the context to understand why something broke after the fact.
What to Track
| Category | Metrics to Monitor | Why |
|---|---|---|
| HTTP Layer | 4xx / 5xx rate, p50/p95/p99 latency | Surface user-facing issues immediately |
| Database | Query duration, connection pool usage, deadlock count | Detect slow queries before timeout |
| External Services | Call success rate, latency, 429 count | Know when a dependency is degrading |
| Business Metrics | Successful transactions/min, failed payments, sign-up rate | Catch logic errors invisible to error rates |
| Infrastructure | CPU, memory, disk I/O, network throughput | Resource exhaustion precedes crashes |
Structured Logging (JSON)
Plain-text logs are hard to query at scale. Use structured JSON logs so log aggregation tools (Grafana Loki, Datadog, ELK) can parse, filter, and alert on them programmatically.
Python โ structlogimport structlog log = structlog.get_logger() # Good โ structured, queryable, no sensitive data log.error( "payment_failed", user_id="u_9a3f", # ID, not email correlation_id="req_abc123", provider="stripe", error_code="card_declined", amount_cents=4999, ) # BAD โ never log PII or secrets # log.error("payment_failed", email="alice@example.com", card="4242...")
Recovery Strategies
Recoverable vs Non-Recoverable
Recoverable Errors
- Transient network glitch to email API
- Database connection pool temporarily exhausted
- Rate limit 429 from external service
Strategy: Retry with exponential backoff. Queue the work. Don't give up immediately.
Non-Recoverable Errors
- Redis cluster completely down
- Payment processor offline for hours
- Corrupt data in the DB
Strategy: Graceful degradation. Fallback. Disable the feature. Protect core functionality.
Exponential Backoff in Go
Gofunc sendEmailWithRetry(to, subject, body string) error { maxRetries := 5 baseDelay := 1 * time.Second for attempt := 0; attempt < maxRetries; attempt++ { err := emailClient.Send(to, subject, body) if err == nil { return nil // success } if !isRetryable(err) { return fmt.Errorf("permanent failure: %w", err) } // Exponential backoff: 1s, 2s, 4s, 8s, 16s wait := baseDelay * time.Duration(1<<attempt) // Add jitter (ยฑ20%) to prevent thundering herd jitter := time.Duration(rand.Int63n(int64(wait / 5))) time.Sleep(wait + jitter) log.Warn("email send failed, retrying", "attempt", attempt+1, "wait_ms", wait.Milliseconds(), "error", err) } return fmt.Errorf("all %d retries exhausted", maxRetries) } func isRetryable(err error) bool { // Retry on 429, 503, network errors; not on 400, 401, 422 var httpErr *HTTPError if errors.As(err, &httpErr) { return httpErr.StatusCode == 429 || httpErr.StatusCode >= 500 } return true // network errors are always retryable }
Automatic vs Manual Recovery
- Automatic: restart crashed processes (systemd, Kubernetes restart policy), clean up corrupted caches, switch to backup systems. Design carefully โ automatic recovery can sometimes amplify a problem.
- Manual: data corruption, payment discrepancies, security incidents. These require human judgment. Document the runbook. Test it. Know who is on-call.
Global Error Handler โ The Final Safety Net
The global error handler is a single middleware that sits at the outermost layer of your application, catches every error that bubbles up from any layer, and converts it into a properly formatted HTTP response.
Two Major Advantages
- No forgotten error conditions โ every unhandled error falls through to the global handler's default case (
500 + "something went wrong"). Nothing silently swallowed. - Zero redundancy โ database error handling logic lives in one file, not scattered across 40 repository methods. Change the unique-violation message once, it applies everywhere.
Go โ Global Error Handler Implementation
Go doesn't have exceptions โ errors are return values. The pattern is to return errors up the call stack and handle them in middleware.
Go โ errors/types.gopackage apperr import "net/http" // AppError is the canonical error type for this application. type AppError struct { Code int // HTTP status code Message string // Safe, user-facing message Details any // Optional: field-level errors for 400s Err error // Original error โ for logging only, NEVER sent to client } func (e *AppError) Error() string { return e.Message } // Constructors func NotFound(resource string) *AppError { return &AppError{Code: http.StatusNotFound, Message: resource + " not found"} } func Conflict(msg string) *AppError { return &AppError{Code: http.StatusConflict, Message: msg} } func BadRequest(msg string, details any) *AppError { return &AppError{Code: http.StatusBadRequest, Message: msg, Details: details} } func Internal(err error) *AppError { return &AppError{ Code: http.StatusInternalServerError, Message: "something went wrong", // NEVER expose err.Error() here Err: err, } }
Go โ middleware/error_handler.gopackage middleware import ( "encoding/json" "errors" "log/slog" "net/http" "github.com/jackc/pgx/v5/pgconn" apperr "yourapp/errors" ) type ErrorResponse struct { Code int `json:"code"` Message string `json:"message"` Details any `json:"details,omitempty"` } // GlobalErrorHandler wraps a handler that returns an error. func GlobalErrorHandler(next func(http.ResponseWriter, *http.Request) error) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { err := next(w, r) if err == nil { return } var appErr *apperr.AppError switch { // Already wrapped as AppError case errors.As(err, &appErr): if appErr.Err != nil { slog.Error("app error", "err", appErr.Err) } // Postgres unique constraint violation โ 409 case isPgError(err, "23505"): appErr = apperr.Conflict("resource already exists") // Postgres foreign key violation โ 404 case isPgError(err, "23503"): appErr = apperr.NotFound("referenced resource") // pgx no-rows โ 404 case errors.Is(err, pgx.ErrNoRows): appErr = apperr.NotFound("resource") // Everything else โ 500 (never leak internal error) default: slog.Error("unhandled error", "err", err) appErr = apperr.Internal(err) } w.Header().Set("Content-Type", "application/json") w.WriteHeader(appErr.Code) json.NewEncoder(w).Encode(ErrorResponse{ Code: appErr.Code, Message: appErr.Message, Details: appErr.Details, }) } } func isPgError(err error, code string) bool { var pgErr *pgconn.PgError return errors.As(err, &pgErr) && pgErr.Code == code }
Python โ Global Error Handler (FastAPI)
Python โ FastAPIfrom fastapi import FastAPI, Request from fastapi.responses import JSONResponse from psycopg2 import errors as pg_errors import logging app = FastAPI() logger = logging.getLogger("app") # --- Custom exception types --- class AppError(Exception): def __init__(self, status: int, message: str, details=None): self.status = status self.message = message self.details = details class NotFoundError(AppError): def __init__(self, resource: str): super().__init__(404, f"{resource} not found") class ConflictError(AppError): def __init__(self, msg: str): super().__init__(409, msg) # --- Global exception handlers --- @app.exception_handler(AppError) async def app_error_handler(request: Request, exc: AppError): return JSONResponse( status_code=exc.status, content={"code": exc.status, "message": exc.message, "details": exc.details} ) @app.exception_handler(pg_errors.UniqueViolation) async def unique_violation_handler(request: Request, exc): logger.warning("unique_violation", extra={"path": request.url.path}) return JSONResponse(status_code=409, content={"code": 409, "message": "resource already exists"}) @app.exception_handler(pg_errors.ForeignKeyViolation) async def fk_violation_handler(request: Request, exc): return JSONResponse(status_code=404, content={"code": 404, "message": "referenced resource not found"}) @app.exception_handler(Exception) async def unhandled_error_handler(request: Request, exc: Exception): # Log the real error internally, never expose it logger.error("unhandled_exception", exc_info=exc, extra={"path": request.url.path}) return JSONResponse(status_code=500, content={"code": 500, "message": "something went wrong"})
Security โ What to Expose, What to Hide
โ Never Leak Internal Details
Database error messages from Postgres contain table names, column names, index names, and constraint names. If you forward a raw pgconn.PgError message directly to the client, an attacker learns your schema and can craft more targeted SQL injection attempts.
| What You Got | What to Send to Client |
|---|---|
duplicate key value violates unique constraint "users_email_key" | "Email already in use" |
relation "usres" does not exist (typo) | "Something went wrong" |
deadlock detected on relation 42816 | "Something went wrong, please retry" |
stack trace: panic at server.go:142 | "Internal server error" |
โก Vague Auth Errors (On Purpose)
Login endpoints are the most attacked surface in any application. If you return specific messages like "no user with this email exists" vs "password is incorrect", an attacker can enumerate valid emails through a simple loop.
โข Safe Logging Practices
Logs are often shipped to third-party aggregation services (Datadog, Grafana Cloud, ELK). In major data breaches, leaked log files exposed millions of records โ because engineers had carelessly logged sensitive fields.
- Never log: passwords, API keys, credit card numbers, SSNs, full email addresses, session tokens.
- Log instead: user ID (not email), correlation/request ID, operation name, error code.
- Use a log scrubbing library (e.g. Go:
slogwith custom handler; Python:structlogprocessors) to automatically redact known sensitive fields.
Go โ safe vs unsafe logging// โ UNSAFE โ never do this slog.Error("login_failed", "email", user.Email, // PII leak "password", req.Password, // catastrophic "api_key", cfg.OpenAIKey, // secret leak ) // โ SAFE โ IDs and correlation only slog.Error("login_failed", "user_id", user.ID, "correlation_id", r.Header.Get("X-Request-ID"), "reason", "invalid_credentials", // generic code, not DB message )
References & Further Reading
Backend Field Manual ยท Error Handling & Fault Tolerance ยท Chapter 16