Automated testing, the proof
your code does what you think.
A first-principles walkthrough of how backend engineers verify their systems — from a single
unit test and the test-double vocabulary, through integration tests against real databases and HTTP
handlers, contract tests between services, end-to-end checks, coverage and flakiness, and into TDD, load
testing, and tests in the deployment pipeline. Written to explain not just what each kind of test
does but why it exists and how to write it well. Examples in both Go (testing)
and Python (pytest).
Why Tests Exist
The honest reason to write automated tests is not "correctness" in the abstract — it's confidence to change code without fear. A system without tests calcifies: every edit risks breaking something invisible, so people stop refactoring, stop upgrading dependencies, and route around the scary parts. A well-tested system stays soft — you can restructure it aggressively because a green test suite tells you, in seconds, that behavior still holds. Tests are what let a codebase keep evolving instead of rotting.
A test is also executable specification. Prose docs drift from reality (the same problem the
gRPC chapter noted about REST contracts); a test that asserts Divide(10, 2) == 5 can't lie —
it either passes against the current code or it doesn't. Read a good test suite and you learn precisely what
the code promises, with examples. And tests are a regression net: once you've written a test
for a bug, that bug can never silently come back — the net catches it on every future run.
Confidence to change (refactor fearlessly), executable specification (docs that can't drift), regression protection (fixed bugs stay fixed), and faster feedback (seconds, not a manual click-through). Correctness is a side effect; changeability is the prize.
The Testing Pyramid
Not all tests are equal, and the most important strategic decision is the mix. The testing pyramid is the guiding heuristic: have many fast, cheap unit tests at the base, fewer integration tests in the middle, and very few slow, expensive end-to-end tests at the top. The shape comes from a trade-off that runs through this entire manual: as you climb, tests get more realistic (they exercise more of the real system) but also slower, more brittle, and harder to debug.
| Layer | Tests | Speed | Realism | Use for |
|---|---|---|---|---|
| Unit | One unit, dependencies faked | Milliseconds | Low | Logic, edge cases, branches |
| Integration | Several units + real deps (DB, cache) | Seconds | Medium | Wiring: queries, handlers, serialization |
| E2E | The whole system via its real interface | Seconds–minutes | High | Critical user journeys only |
Invert the pyramid — lots of slow E2E tests, few unit tests — and you get the "ice-cream cone": a suite that takes ages to run, fails intermittently (§14), and gives vague signals (a failure could be anywhere in the stack). Teams in this trap stop trusting the suite and stop running it. Push detail down to fast unit tests; use E2E sparingly for confidence that the pieces connect.
Anatomy of a Test
Every good test, at every layer, has the same three-beat structure — Arrange, Act, Assert (AAA; the BDD world calls it Given–When–Then). Arrange sets up the world and inputs; Act performs the one operation under test; Assert checks the outcome is what you expect. Keeping these phases distinct and testing one behavior per test is what makes a failure instantly legible: you know exactly which behavior broke.
// File naming convention: foo.go is tested by foo_test.go in the same package.
// Test functions are func TestXxx(t *testing.T) — the `go test` tool finds them.
func TestDiscount_AppliesPercentage(t *testing.T) {
// Arrange — set up inputs (and any doubles)
cart := Cart{Subtotal: 200}
coupon := Coupon{Percent: 10}
// Act — the ONE operation under test
total := ApplyDiscount(cart, coupon)
// Assert — state the expectation; a good message explains the failure
if total != 180 {
t.Errorf("ApplyDiscount(200, 10%%) = %d; want 180", total)
}
}# pytest discovers files named test_*.py and functions named test_*.
# A bare `assert` is enough — pytest rewrites it to show a rich failure diff.
def test_discount_applies_percentage():
# Arrange — set up inputs (and any doubles)
cart = Cart(subtotal=200)
coupon = Coupon(percent=10)
# Act — the ONE operation under test
total = apply_discount(cart, coupon)
# Assert — pytest prints actual vs expected automatically on failure
assert total == 180What makes an assertion good
- Test behavior, not implementation. Assert the observable result (the discount is applied), not internal mechanics (this private helper was called). Implementation-coupled tests break on every refactor and defeat the whole purpose (§1, §20).
- One reason to fail. A test asserting five unrelated things tells you little when it goes red. Prefer focused tests with descriptive names — the name should read like a sentence about the behavior.
- Deterministic. Same inputs, same result, every run, in any order — the property §14 is all about protecting.
This manual uses each language's mainstream stack: Go's built-in testing package with
go test (no framework needed; the standard library is the framework), and Python's
pytest, the de-facto standard with fixtures, parametrization, and plain-assert
rewriting. Both are what you'll find in real backend repos.
Unit Tests
A unit test verifies one small piece of logic — a function or a method — in isolation from the rest of the system. No database, no network, no filesystem, no clock: just inputs in, output checked. That isolation is what makes them fast (thousands run in seconds) and precise (a failure points at exactly one unit). They are the wide base of the pyramid (§2) and where the bulk of your edge-case and branch coverage should live.
The natural target for unit tests is your domain/business logic — the pure calculations and decisions that don't inherently need I/O: pricing rules, validation, state transitions, parsing. (This maps directly onto the service layer from the controllers-services-repositories chapter; pushing logic into pure functions there is precisely what makes it unit-testable here.)
// The unit: pure logic, no I/O — trivially testable.
func IsStrongPassword(p string) bool {
if len(p) < 8 { return false }
var hasDigit, hasUpper bool
for _, r := range p {
if r >= '0' && r <= '9' { hasDigit = true }
if r >= 'A' && r <= 'Z' { hasUpper = true }
}
return hasDigit && hasUpper
}
func TestIsStrongPassword(t *testing.T) {
if IsStrongPassword("short1A") { // 7 chars → too short
t.Error("expected short password to be rejected")
}
if !IsStrongPassword("longEnough9") { // meets all rules
t.Error("expected valid password to be accepted")
}
}
// Run: go test ./... (-v for verbose, -run TestIsStrong to filter)# The unit: pure logic, no I/O — trivially testable.
def is_strong_password(p: str) -> bool:
if len(p) < 8:
return False
return any(c.isdigit() for c in p) and any(c.isupper() for c in p)
def test_rejects_short_password():
assert is_strong_password("short1A") is False # 7 chars → too short
def test_accepts_valid_password():
assert is_strong_password("longEnough9") is True # meets all rules
# Run: pytest (-v verbose, -k strong to filter by name)When a function is painful to unit-test — it reaches into a global, opens a connection, reads the clock, or does ten things — the test is telling you the code is poorly factored, not that testing is hard. The fix is almost always to separate pure logic from I/O and inject dependencies (§6). Testability and good design are the same property viewed from two angles.
Test Doubles
Real units depend on other things — a repository, a payment gateway, an email sender. To test a unit in isolation you replace those collaborators with test doubles: stand-ins that you control. "Mock" is used loosely in conversation, but the precise vocabulary (from Gerard Meszaros / Martin Fowler) distinguishes five kinds, and knowing the difference sharpens how you test.
| Double | What it is | You use it to… |
|---|---|---|
| Dummy | A placeholder passed but never used | satisfy a required parameter |
| Stub | Returns canned answers to calls | feed the unit fixed inputs (e.g. "the DB returns this user") |
| Spy | A stub that also records how it was called | assert later that something was called, and how |
| Mock | Pre-programmed with expectations it verifies | assert an interaction happened (e.g. "email was sent once") |
| Fake | A working but lightweight implementation | replace a real dependency cheaply (e.g. an in-memory DB) |
State verification vs interaction verification
The deepest distinction: stubs/fakes support state verification (you assert the result), while mocks/spies support interaction verification (you assert a call happened). Prefer state verification — it's robust to refactoring. Reach for interaction verification only when the interaction is the behavior (an email genuinely must be sent, a charge must be issued exactly once). Over-mocking interactions is a leading cause of brittle tests (§20).
// The unit depends on an interface — the seam that lets us substitute a double (§6).
type UserRepo interface{ FindByID(id string) (*User, error) }
type Notifier interface{ Send(to, msg string) error }
func Greet(repo UserRepo, n Notifier, id string) error {
u, err := repo.FindByID(id) // collaborator 1
if err != nil { return err }
return n.Send(u.Email, "Hi "+u.Name) // collaborator 2 (interaction we care about)
}
// STUB: returns a canned user. SPY: also records what Send received.
type stubRepo struct{ user *User }
func (s stubRepo) FindByID(string) (*User, error) { return s.user, nil }
type spyNotifier struct{ calls int; lastTo string }
func (s *spyNotifier) Send(to, msg string) error { s.calls++; s.lastTo = to; return nil }
func TestGreet_SendsEmail(t *testing.T) {
repo := stubRepo{user: &User{Email: "a@x.com", Name: "Ada"}} // arrange (stub)
spy := &spyNotifier{}
_ = Greet(repo, spy, "u1") // act
if spy.calls != 1 || spy.lastTo != "a@x.com" { // assert the interaction
t.Errorf("expected one email to a@x.com, got %d to %q", spy.calls, spy.lastTo)
}
}from unittest.mock import Mock
# The unit depends on injected collaborators (§6).
def greet(repo, notifier, user_id):
user = repo.find_by_id(user_id) # collaborator 1
notifier.send(user.email, f"Hi {user.name}") # collaborator 2 (the interaction)
def test_greet_sends_email():
# STUB: a Mock configured to return a canned user
repo = Mock()
repo.find_by_id.return_value = User(email="a@x.com", name="Ada")
notifier = Mock() # will record calls (acts as spy/mock)
greet(repo, notifier, "u1") # act
# assert the INTERACTION happened exactly as expected
notifier.send.assert_called_once_with("a@x.com", "Hi Ada")Go tends to hand-write small doubles against an interface (explicit, no magic); Python's
unittest.mock generates flexible doubles that both stub returns and record calls. Same concept,
different ergonomics.
Avoid mocking third-party libraries or external HTTP APIs directly — your mock encodes your assumption of how they behave, which can be wrong, so the test passes while production breaks. Instead wrap the dependency behind your own interface and mock that, or use a fake/contract test (§11, §17). Mock your own seams, not someone else's internals.
Dependency Injection for Testability
The single technique that makes unit testing possible is dependency injection (DI): instead of a unit creating or reaching for its collaborators, it receives them from the outside. That hand-off point is a seam — the place where a test can slip in a double (§5). No seam, no isolation. This is why the controllers-services-repositories architecture insists on passing dependencies down: it's not ceremony, it's what keeps the code testable.
// Depend on an INTERFACE, not a concrete type — that interface is the seam.
type OrderRepo interface{ Save(o Order) error }
type OrderService struct{ repo OrderRepo } // injected, not constructed inside
func NewOrderService(r OrderRepo) *OrderService { return &OrderService{repo: r} }
func (s *OrderService) Place(o Order) error {
if o.Total <= 0 { return errors.New("invalid total") } // pure logic, easily tested
return s.repo.Save(o)
}
// In tests: a FAKE in-memory repo (no real DB needed)
type fakeRepo struct{ saved []Order }
func (f *fakeRepo) Save(o Order) error { f.saved = append(f.saved, o); return nil }
func TestPlace_RejectsInvalidTotal(t *testing.T) {
svc := NewOrderService(&fakeRepo{})
if err := svc.Place(Order{Total: 0}); err == nil {
t.Error("expected error for non-positive total")
}
}
// In production: NewOrderService(postgresRepo) — same code, real dependency.# Accept the collaborator as a constructor argument — that parameter is the seam.
class OrderService:
def __init__(self, repo): # injected, not created inside
self.repo = repo
def place(self, order):
if order.total <= 0:
raise ValueError("invalid total") # pure logic, easily tested
self.repo.save(order)
# In tests: a FAKE in-memory repo (no real DB needed)
class FakeRepo:
def __init__(self): self.saved = []
def save(self, order): self.saved.append(order)
def test_place_rejects_invalid_total():
svc = OrderService(FakeRepo())
with pytest.raises(ValueError):
svc.place(Order(total=0))
# In production: OrderService(postgres_repo) — same code, real dependency.DI, small interfaces, and separating pure logic from I/O aren't "test tricks" — they're the same principles that make code modular and maintainable. A codebase that's easy to test is, almost by definition, well-structured. The test suite is a continuous design-quality signal.
Table-Driven & Parametrized Tests
Most logic needs checking against many input/output pairs — happy path, edge cases, boundaries, error cases. Copy-pasting a test per case is noise. Both languages have an idiom for "same test body, many cases": Go's table-driven tests and pytest's parametrize. This is where you get cheap, dense coverage of branches and edges — the real workhorse of the unit layer.
func TestApplyDiscount(t *testing.T) {
// The TABLE: each row is one case with a name.
cases := []struct {
name string
subtotal int
percent int
want int
}{
{"ten percent off 200", 200, 10, 180},
{"zero discount", 100, 0, 100},
{"full discount", 100, 100, 0},
{"rounds down", 99, 10, 90}, // boundary/edge case
}
for _, c := range cases {
// t.Run makes each row a named subtest — failures report the case name.
t.Run(c.name, func(t *testing.T) {
got := ApplyDiscount(Cart{Subtotal: c.subtotal}, Coupon{Percent: c.percent})
if got != c.want {
t.Errorf("got %d, want %d", got, c.want)
}
})
}
}
// `go test -run TestApplyDiscount/rounds_down` runs just one case.import pytest
# Each tuple is one case; ids give readable names in the output.
@pytest.mark.parametrize(
"subtotal, percent, expected",
[
(200, 10, 180),
(100, 0, 100),
(100, 100, 0),
(99, 10, 90), # boundary/edge case
],
ids=["ten-off-200", "zero", "full", "rounds-down"],
)
def test_apply_discount(subtotal, percent, expected):
got = apply_discount(Cart(subtotal=subtotal), Coupon(percent=percent))
assert got == expected
# `pytest -k rounds-down` runs just one case; each shows up as its own test.The payoff of these idioms isn't just less code — it's that each case becomes an independently named, independently runnable test. When one breaks, the report names the exact case ("rounds-down failed"), and you can re-run only it. Always give cases descriptive names; an anonymous row that fails just says "case 3," which helps no one.
Integration Tests
Unit tests prove each piece works alone; integration tests prove the pieces work together — and crucially, that they integrate correctly with the real external systems you mocked away at the unit layer: the database, the cache, the message broker, another service. They sit in the middle of the pyramid (§2): slower than unit tests because real I/O is involved, but they catch a whole category of bugs unit tests structurally cannot.
A SQL query with a typo'd column, an ORM mapping that doesn't match the schema, a serialization mismatch between services, a migration that didn't run, a wrong connection string, a transaction that doesn't commit. Every one of these lives in the seam between your code and a real system — exactly the seam a unit test stubs out. Mocking the DB can't reveal that your query is invalid SQL; only running it against a real database can.
Integration tests touch shared state, so they're prone to flakiness (§14) if they leak into each other. Each test must start from a known state and clean up after itself — a fresh schema, a transaction rolled back, or a truncated table (§9). Tests that depend on a manually-seeded shared database, or on running in a particular order, will eventually betray you.
Testing with a Real Database
The recurring question: how do you get a "real" database in a test without depending on a hand-maintained shared server? The modern answer is Testcontainers — a library that spins up a real database in a throwaway Docker container (the containers of chapter 21) at test time, gives you its connection string, and tears it down after. You test against the actual engine (same Postgres version as prod), fully isolated, with nothing to install. For keeping tests independent of each other, the two standard tactics are transaction rollback (wrap each test in a transaction, roll back at the end) or truncating tables between tests.
// Build tag keeps slow integration tests out of the fast unit run:
// //go:build integration → run with: go test -tags=integration ./...
func TestUserRepo_SaveAndFind(t *testing.T) {
ctx := context.Background()
// Arrange: start a REAL Postgres in a throwaway container (testcontainers-go)
pg, err := postgres.Run(ctx, "postgres:16",
postgres.WithDatabase("app"), postgres.WithUsername("u"), postgres.WithPassword("p"))
if err != nil { t.Fatal(err) }
t.Cleanup(func() { pg.Terminate(ctx) }) // torn down automatically after the test
dsn, _ := pg.ConnectionString(ctx, "sslmode=disable")
db, _ := sql.Open("postgres", dsn)
runMigrations(t, db) // apply the same schema as production
repo := NewUserRepo(db)
// Act: exercise the REAL query path
_ = repo.Save(User{ID: "u1", Email: "a@x.com"})
got, err := repo.FindByID("u1")
// Assert: round-trips correctly through actual SQL
if err != nil || got.Email != "a@x.com" {
t.Fatalf("round-trip failed: %v / %+v", err, got)
}
}import pytest
from testcontainers.postgres import PostgresContainer
# A fixture spins up a REAL Postgres once per module, then tears it down.
@pytest.fixture(scope="module")
def db_url():
with PostgresContainer("postgres:16") as pg: # throwaway container
url = pg.get_connection_url()
run_migrations(url) # same schema as production
yield url
# container stopped automatically on exit
@pytest.fixture
def repo(db_url):
# function-scoped: clean state per test (truncate / rollback)
r = UserRepo(db_url)
yield r
r.truncate_all() # keep tests independent (§8)
@pytest.mark.integration # marker: pytest -m integration
def test_save_and_find(repo):
repo.save(User(id="u1", email="a@x.com")) # act through REAL SQL
got = repo.find_by_id("u1")
assert got.email == "a@x.com" # round-trips correctlyBoth isolate slow tests behind a flag/marker (Go build tags, pytest markers) so the millisecond unit suite stays fast and the heavier DB tests run on demand or in CI (§19).
Swapping in SQLite "because it's fast" when production runs Postgres is a classic trap: dialects differ (types, constraints, JSON, upserts, locking), so a test can pass on SQLite and the same query fail on Postgres. Testcontainers exists precisely so you can use the actual production engine cheaply — test what you ship.
Testing HTTP Handlers & APIs
Your handlers (the HTTP and REST chapters) are a prime integration target: you want to fire a real request at
your routing + handler + serialization stack and assert on the real response — status code, headers, body
— without binding a socket or running a server over the network. Both ecosystems provide an
in-process test client that drives the handler directly: Go's net/http/httptest and the
TestClient shipped with FastAPI/Starlette (and Flask's equivalent).
import (
"net/http"
"net/http/httptest"
"strings"
"testing"
)
func TestCreateUser_Returns201(t *testing.T) {
handler := NewRouter(deps) // the REAL router + handlers (deps may use a fake repo §6)
// Arrange: build a request and an in-memory response recorder (no network)
body := strings.NewReader(`{"email":"a@x.com","name":"Ada"}`)
req := httptest.NewRequest(http.MethodPost, "/api/v1/users", body)
req.Header.Set("Content-Type", "application/json")
rec := httptest.NewRecorder()
// Act: drive the handler directly
handler.ServeHTTP(rec, req)
// Assert: on the real response — status, then body
if rec.Code != http.StatusCreated {
t.Fatalf("status = %d; want 201; body=%s", rec.Code, rec.Body.String())
}
if ct := rec.Header().Get("Content-Type"); !strings.Contains(ct, "application/json") {
t.Errorf("content-type = %q", ct)
}
}from fastapi.testclient import TestClient
from app import app # the REAL FastAPI app (dependencies may be overridden with fakes §6)
client = TestClient(app) # drives the app in-process — no network, no running server
def test_create_user_returns_201():
# Act: fire a real request through routing + handler + serialization
resp = client.post("/api/v1/users", json={"email": "a@x.com", "name": "Ada"})
# Assert: on the real response — status, headers, body
assert resp.status_code == 201
assert resp.headers["content-type"].startswith("application/json")
assert resp.json()["email"] == "a@x.com"
# FastAPI tip: override real dependencies with fakes via app.dependency_overrides
# so the handler runs but the repo is in-memory — an integration test of the web layer.These tests are flexible: wire in a fake repository and you're integration-testing just the web layer (routing, decoding, status codes, error envelopes) fast and deterministically; wire in a real DB via testcontainers (§9) and you've got a broader integration test of the whole request path. Pick the scope deliberately per test — narrower is faster and more precise; wider is more realistic.
Contract Testing
In a microservice world (the gRPC and API-design chapters), a service rarely lives alone — it consumes others and is consumed by others. The danger: a provider changes its response shape, all its own tests pass, but it silently breaks every consumer. End-to-end testing every combination is prohibitively slow. Contract testing solves this by verifying both sides agree on the interface — without running them together.
Tools like Pact implement this: the consumer's tests record the requests it makes and the
responses it expects into a contract file; the provider's CI replays that contract against the real provider
and fails if it no longer holds. For gRPC/Protobuf services, the .proto plus schema-breaking-change
detection (the buf tooling from the gRPC chapter) plays a similar role — the schema is a machine-
checked contract, and CI rejects incompatible changes.
They're the pragmatic middle ground between "mock the other service and hope your mock is accurate" (fast but can drift from reality — the §5 warning) and "spin up everything and test end-to-end" (realistic but slow and brittle). Contract tests give cross-service confidence at unit-test speed, which is why they scale to large service fleets where full E2E across all services is impractical.
End-to-End Tests
An end-to-end (E2E) test exercises the fully assembled system through its real external interface, exactly as a user or client would — no internals stubbed, everything running: the API, the database, the cache, dependent services. It answers the one question no lower layer can: do all the pieces, wired together for real, actually deliver the user-facing outcome? It's the tip of the pyramid (§2): maximum realism, maximum cost.
E2E tests are slow, expensive to maintain, and the most flake-prone (§14): a hiccup anywhere — network, timing, a dependency — fails them, and a failure doesn't tell you where the problem is. So reserve E2E for a handful of critical journeys ("user signs up → logs in → places an order → payment succeeds"). Don't test edge cases or every branch here — those belong in fast unit tests (§4). E2E confirms the system connects; lower layers confirm it's correct.
A close cousin is the smoke test: a tiny set of E2E checks run right after a deploy to
confirm the system is alive ("is /health green? can I log in?") — pairing naturally with
the health probes and rollouts of chapter 21. Mature teams extend this with testing in production:
synthetic monitoring, canary analysis (§20 there), and feature flags — because some properties only
reveal themselves under real traffic.
Test Coverage
Coverage measures how much of your code the tests execute — typically as a percentage of lines or branches run during the suite. It's useful as a flashlight for finding code nothing tests at all, but it is dangerously easy to misread as a measure of test quality. It isn't. Coverage tells you what was executed, never what was meaningfully asserted.
# Built into the toolchain — no extra dependency.
go test -cover ./... # prints % per package
go test -coverprofile=cover.out ./... # write a detailed profile
go tool cover -func=cover.out # per-function breakdown
go tool cover -html=cover.out # open an annotated, line-by-line HTML view# Via the coverage.py / pytest-cov plugin.
pip install pytest-cov
pytest --cov=app # coverage for the `app` package
pytest --cov=app --cov-report=term-missing # show exactly which lines are UNcovered
pytest --cov=app --cov-report=html # browsable htmlcov/ report
pytest --cov=app --cov-branch # branch coverage, not just lineLine vs branch coverage
Line coverage asks "was this line run?" Branch coverage asks the harder
question "was each direction of each decision taken?" A single test can hit 100% of the lines in an
if while only ever taking the true branch — the else path is unexecuted and
untested though line coverage looks complete. Branch coverage is the more honest number; prefer it.
You can reach 100% coverage with tests that assert nothing — just call every function and check no exception. Fully covered, totally worthless. The moment a coverage percentage becomes a target, people game it with assertion-free tests (Goodhart's law: a measure that becomes a target stops being a good measure). Use coverage to find untested areas worth attention; judge quality by whether tests assert meaningful behavior (§3, §20). High coverage with weak assertions is false confidence.
Flaky Tests
A flaky test passes sometimes and fails other times without any code change. Flakiness is corrosive: it trains the team to ignore red builds ("just re-run it"), and once people stop trusting the suite, the suite is dead — a green run no longer means anything. Flaky tests are arguably worse than no test, because they cost time and erode the trust that gives tests their value (§1).
| Cause of flakiness | Fix |
|---|---|
| Time / clock — "now", timeouts, sleeps, date boundaries | Inject the clock; assert ranges not exact instants (§17). Never sleep to "wait"; poll a condition. |
| Test-order dependence — one test relies on another's leftovers | Make each test set up & tear down its own state (§9, §15). Run in random order to flush these out. |
| Shared mutable state — a global, a shared DB row, a singleton | Isolate per test; avoid global state; use fresh fixtures. |
| Concurrency / races — goroutines, threads, async timing | Synchronize on signals not timing; run with the race detector (go test -race). |
| Real network / external API — latency, rate limits, downtime | Mock the boundary (§17); keep real calls to a tiny, tolerant E2E set. |
Unordered data — map iteration, DB rows without ORDER BY | Sort before comparing, or assert set membership not sequence. |
Every flaky test is a hidden non-determinism — time, ordering, concurrency, or shared state leaking in. The fix is always to remove the source of randomness: control the clock, isolate state, synchronize on events instead of sleeps, and stub the network. Quarantine a flaky test (mark and skip from the gating run) only as a temporary measure while you find the real cause — never paper over it with a blanket "retry until green," which just hides the bug.
Fixtures & Factories
Tests need data and a known starting state, and how you build them decides whether your suite stays readable or rots into setup soup. Two related tools: fixtures manage setup and teardown (the world a test runs in — a DB connection, a temp dir, a logged-in client), and factories build test objects with sensible defaults so each test states only the fields it actually cares about.
// Go favors explicit helpers over magic. t.Helper() keeps failure line numbers useful.
// A "factory" with functional options: specify only the fields that matter to the test.
func newUser(t *testing.T, opts ...func(*User)) User {
t.Helper()
u := User{ID: "u1", Email: "default@x.com", Active: true} // sensible defaults
for _, o := range opts { o(&u) }
return u
}
func withEmail(e string) func(*User) { return func(u *User) { u.Email = e } }
// A "fixture" via t.Cleanup: setup returns the thing + auto-teardown.
func newTempStore(t *testing.T) *Store {
t.Helper()
dir := t.TempDir() // auto-removed after the test
s := OpenStore(dir)
t.Cleanup(func() { s.Close() }) // teardown runs even if the test fails
return s
}
func TestSignup(t *testing.T) {
store := newTempStore(t)
u := newUser(t, withEmail("ada@x.com")) // only the relevant field is stated
_ = store.Save(u)
// ...assert...
}import pytest
# pytest FIXTURES: setup before `yield`, teardown after. Injected by parameter name.
@pytest.fixture
def temp_store(tmp_path): # tmp_path is a built-in fixture (auto-cleaned dir)
store = Store(tmp_path)
yield store # the test runs here
store.close() # teardown — runs even if the test fails
# A FACTORY: defaults + overrides, so tests state only what they care about.
def make_user(**overrides):
return User(**{"id": "u1", "email": "default@x.com", "active": True, **overrides})
def test_signup(temp_store):
user = make_user(email="ada@x.com") # only the relevant field is stated
temp_store.save(user)
# ...assert...
# Libraries like factory_boy / model_bakery scale this up for complex object graphs.The point of factories is to hide irrelevant setup, not to hide the data a test depends on. If a
test asserts something about a premium user, the word "premium" should appear in that test
(make_user(tier="premium")), not be buried in a shared fixture three files away. A reader should
understand a test's premise without spelunking. Over-shared, opaque fixtures (the "mystery guest") make
failures hard to diagnose and tests hard to trust.
TDD — Red / Green / Refactor
Test-Driven Development inverts the usual order: you write a failing test first, then the minimum code to pass it, then clean up — repeating in tiny cycles. The discipline is the famous three-beat loop: Red (write a test, watch it fail), Green (make it pass as simply as possible), Refactor (improve the design while the test stays green). Writing the test first forces you to define the desired behavior and a clean interface before you're entangled in implementation.
TDD's real benefits are second-order: it produces code that is testable by construction (you can't write an untestable unit if the test came first), it keeps you focused on one small behavior at a time, and the "refactor" step is safe precisely because the test you just wrote guards it. It is a design technique as much as a testing one.
TDD shines when requirements are clear and logic is non-trivial (parsers, business rules, algorithms). It fits awkwardly when you're exploring — spiking a prototype, sketching a UI, probing an unfamiliar API — where the design isn't yet known. Plenty of excellent engineers write tests immediately after the code rather than strictly before. What matters is that the behavior ends up well-tested; test-first is one effective path to that, not the only valid one.
Mocking Time, Randomness & External APIs
Three sources of non-determinism wreck tests if left real — the clock, the random generator, and the network — and they're the usual suspects behind flakiness (§14). The cure for all three is the same: turn the hidden dependency into an injected one (§6) so the test controls it.
Time & randomness: inject them
Code that calls time.Now() or random() directly is untestable and flaky — its
output changes every run. Make the clock and the RNG dependencies the unit receives, then pass a fixed clock
(always returns the same instant) or a seeded RNG in tests. Now "token expires in 1 hour" or "pick a random
shard" is fully deterministic and assertable.
External HTTP: stub the boundary
Never hit a real third-party API in a unit/integration test — it's slow, flaky, rate-limited, and may
have side effects. Both ecosystems let you intercept HTTP locally: Go's httptest.Server stands up
a fake server you point your client at; Python's responses/respx patch the HTTP layer
to return canned responses. (And recall §5: prefer wrapping the API behind your own interface so most tests
mock that, reserving HTTP-level stubs for testing the client adapter itself.)
// 1) Injected clock — deterministic time.
type Clock interface{ Now() time.Time }
type fixedClock struct{ t time.Time }
func (f fixedClock) Now() time.Time { return f.t }
func TestTokenExpiry(t *testing.T) {
clk := fixedClock{t: time.Date(2026, 1, 1, 12, 0, 0, 0, time.UTC)} // frozen
tok := NewToken(clk, time.Hour)
if tok.ExpiresAt != clk.Now().Add(time.Hour) {
t.Error("expiry not computed from the injected clock")
}
}
// 2) Stub an external HTTP API with httptest.Server — no real network.
func TestFetchRate(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.Write([]byte(`{"usd_inr": 83.2}`)) // canned response
}))
defer srv.Close()
client := NewRatesClient(srv.URL) // point the client at the fake server's URL
rate, _ := client.USDINR()
if rate != 83.2 { t.Errorf("got %v", rate) }
}import responses
from datetime import datetime, timezone
# 1) Injected clock — deterministic time (freezegun is a popular alternative).
def test_token_expiry():
frozen = datetime(2026, 1, 1, 12, 0, tzinfo=timezone.utc)
tok = make_token(now=lambda: frozen, ttl=3600) # clock injected as a callable
assert tok.expires_at == frozen.timestamp() + 3600
# 2) Stub external HTTP with `responses` — patches the HTTP layer, no real network.
@responses.activate
def test_fetch_rate():
responses.add(
responses.GET, "https://api.rates.com/usd-inr",
json={"usd_inr": 83.2}, status=200, # canned response
)
client = RatesClient("https://api.rates.com")
assert client.usd_inr() == 83.2A test must produce the same result on every run, on every machine, in any order. The instant real time, real randomness, or a real network sneaks in, you've planted a future flaky failure. Inject and control all three — it's the same DI seam from §6 applied to the three things that change on their own.
Performance & Load Testing
Functional tests prove the code is correct; performance tests prove it's fast enough — a different axis entirely, and one that matters acutely for the scaling concerns of chapter 18–19. Two distinct flavors: benchmarks (micro: how fast is this function/endpoint?) and load tests (macro: how does the whole system behave under N concurrent users?).
| Kind | Question | Tool |
|---|---|---|
| Benchmark | How fast / how many allocations for this code path? | Go testing.B; Python pytest-benchmark |
| Load test | Throughput & latency under sustained concurrent traffic? | k6, Locust, Gatling, vegeta |
| Stress test | Where does it break, and how does it fail? | same tools, ramped past capacity |
| Soak test | Does it degrade over hours (leaks, fragmentation)? | same tools, long duration |
// Benchmarks live beside tests; the tool tunes b.N until timing is stable.
func BenchmarkParseEvent(b *testing.B) {
payload := []byte(`{"type":"click","ts":1717000000}`)
b.ReportAllocs() // also report allocations/op
for i := 0; i < b.N; i++ { // the loop the framework times
_, _ = ParseEvent(payload)
}
}
// Run: go test -bench=. -benchmem
// BenchmarkParseEvent-8 3142051 382 ns/op 96 B/op 2 allocs/op
// Compare runs with `benchstat` to catch performance REGRESSIONS over time.# pip install pytest-benchmark — the `benchmark` fixture times the callable.
def test_parse_event_perf(benchmark):
payload = b'{"type":"click","ts":1717000000}'
result = benchmark(parse_event, payload) # runs it many times, reports stats
assert result.type == "click" # still assert correctness
# Output reports min/mean/median/stddev; --benchmark-compare flags regressions.import http from 'k6/http';
import { check } from 'k6';
export const options = {
stages: [
{ duration: '30s', target: 100 }, // ramp up to 100 virtual users
{ duration: '1m', target: 100 }, // hold
{ duration: '30s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<300'], // 95th-percentile latency must stay < 300ms
http_req_failed: ['rate<0.01'], // error rate must stay under 1%
},
};
export default function () {
const res = http.get('https://staging.example.com/api/v1/health');
check(res, { 'status 200': (r) => r.status === 200 });
}
# Run: k6 run load_test.js — fails the build if a threshold is breached.
Averages lie — a 50ms mean can hide a 2-second p99 that's wrecking real users. Always track tail latency (p95/p99). And run load tests against a production-like environment, never your dev machine, and judge results against an explicit SLO (the threshold), not a vibe. A benchmark or load test without a target number is just trivia.
Tests in CI/CD
A test suite is only valuable if it runs automatically, on every change, as a gate. This is where testing meets the pipeline from chapter 21 (§19 there): on every push, CI runs the tests, and a failure blocks the merge or deploy. The whole point of the pyramid's speed gradient (§2) is to give fast feedback here — run the millisecond unit tests on every commit, the heavier integration/E2E tests at the right stage.
name: tests
on: [push, pull_request]
jobs:
unit: # fast — runs on every push
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: go test -race -cover ./... # Go: race detector + coverage
# - run: pytest -m "not integration" --cov # Python: skip slow tests here
integration: # heavier — real DB via services / testcontainers
runs-on: ubuntu-latest
needs: unit # only if unit tests passed
services:
postgres:
image: postgres:16
env: { POSTGRES_PASSWORD: p }
ports: ["5432:5432"]
steps:
- uses: actions/checkout@v4
- run: go test -tags=integration ./... # or: pytest -m integration
# A failed job fails the check → branch protection blocks the merge.
If tests are slow, people skip them — so speed is a feature. Run tests in parallel (Go does so across
packages by default and within a test via t.Parallel(); pytest via pytest-xdist),
split the fast unit gate from slow integration/E2E stages, and cache dependencies. A suite that returns in
under a few minutes gets run; one that takes thirty gets bypassed, and an un-run test protects nothing.
A Testing Strategy
Tests have a cost — to write, to run, and to maintain as the code changes — so the goal is never "test everything" but maximum confidence per unit of effort. A coherent strategy is mostly a set of judgment calls about what to test, at which layer, and what to leave alone.
What to test heavily
- Business logic & rules — pricing, validation, state machines, anything with branches and edge cases. Cheap, fast unit tests; the highest payoff (§4, §7).
- Bugs you've fixed — every fix gets a regression test so it can't return (§1).
- Critical paths & money — auth, payments, data integrity get coverage at multiple layers, including a thin E2E (§12).
- Integration seams — your queries, serialization, and the contracts between services (§8–11), where the subtle bugs hide.
What to test lightly or not at all
- Trivial code — plain getters, framework glue, generated code. Testing them is effort without insight.
- Third-party libraries — assume they work; test your use of them, not their internals (§5).
- Implementation details — private helpers and exact call sequences. Test observable behavior so refactors don't trigger spurious failures.
The single principle that separates a suite that helps from one that hurts: assert what the code does (its observable inputs→outputs and side effects), never how it does it internally. Behavior-focused tests survive refactoring — they go green as long as the contract holds, which is exactly the confidence-to-change that justified tests in the first place (§1). Implementation- coupled tests break on every restructure, punish improvement, and slowly get deleted. Write tests your future self will thank you for, not curse.
Cheat-Sheet
The whole manual compressed to what you reach for under pressure.
| Concept | One-liner |
|---|---|
| Why test | Confidence to change, executable spec, regression net, fast feedback. |
| Pyramid | Many unit (fast) · some integration · few E2E (slow). Don't invert it. |
| AAA | Arrange, Act, Assert — one behavior per test, descriptive name. |
| Unit test | One unit in isolation, no I/O — milliseconds, precise. |
| Test doubles | Dummy / Stub / Spy / Mock / Fake. Prefer state over interaction verification. |
| Don't mock | What you don't own — wrap it behind your own interface and mock that. |
| DI | Inject collaborators → the seam where a double slips in. Untestable = design smell. |
| Table/parametrize | One test body, many named cases — dense edge-case coverage. |
| Integration | Real DB/deps wired together — catches SQL, mapping, serialization bugs. |
| Real DB | Testcontainers: real engine, throwaway, isolated. Not SQLite-for-Postgres. |
| HTTP tests | httptest / TestClient — drive handlers in-process, no socket. |
| Contract test | Verify provider/consumer agree on the interface, without running both (Pact, buf). |
| E2E | Whole system via real interface — critical journeys only; smoke-test after deploy. |
| Coverage | Finds untested code; not a quality measure. Prefer branch; never a target (Goodhart). |
| Flaky | Non-determinism: time, order, shared state, concurrency, network. Cure = determinism. |
| Fixtures/factories | Manage setup/teardown & build data with defaults; keep relevant data visible. |
| TDD | Red → Green → Refactor. A design tool; use where it fits, not dogmatically. |
| Determinism | Inject the clock, seed the RNG, stub the network — control all three. |
| Perf | Benchmarks (micro) + load tests (macro). Track p95/p99 vs an SLO, not averages. |
| CI gate | Tests run on every push and block merge/deploy. Keep them fast & parallel. |
| Strategy | Confidence per effort. Test logic, bugs, money, seams; skip trivia & internals. |
| Golden rule | Test behavior, not implementation — survives refactoring. |
The whole topic in one breath: automated tests exist to give you confidence to change code (§1), and the strategic shape is a pyramid — many fast unit tests, fewer integration, a thin E2E layer (§2). Every test is Arrange–Act–Assert on one behavior (§3). Unit tests isolate a unit (§4) using test doubles (§5) made possible by dependency injection (§6), with table-driven and parametrized cases for dense coverage (§7). Integration tests wire real dependencies together (§8) — a real database via testcontainers (§9), handlers via in-process clients (§10), and cross-service agreement via contract tests (§11). E2E covers critical journeys only (§12); coverage is a flashlight not a goal (§13); flakiness is non-determinism to be eliminated (§14); and fixtures and factories keep setup clean (§15). TDD drives design in red/green/refactor loops (§16), determinism comes from injecting time, randomness, and the network (§17), performance is its own axis of benchmarks and load tests against an SLO (§18), and the suite earns its keep only by running as a fast gate in CI/CD (§19). Above all: test behavior, not implementation (§20).
Grounded in the Go testing docs & pytest docs · Testcontainers · the testing-
pyramid literature (Cohn, Fowler) · Go 1.22+ / Python 3.11+ examples.