Chapter 12 · Databases

Mastering Databases
with Postgres

Interacting with a database is the single most frequent thing a backend engineer does. This chapter builds the whole picture from the ground up — why databases exist at all, why they live on disk, how to model real relationships, how to query them safely and fast, and the two performance levers (indexes and triggers) that separate a toy from production. Everything is grounded in one running example: a project-management platform.

Why Databases Exist

At its core a database is simply a way to persist information across sessions. Persistence means storing data so it survives even after the program that created it has stopped. Think of a to-do app: you add items, check some off, close the app — and when you reopen it, everything is exactly as you left it. Without persistence you'd start from an empty list every single time and lose all your progress. That continuity, across time and across machines, is the whole point.

"Database" is a surprisingly broad word

In the simplest sense, any structured storage is a database. Your phone's contact list is a database. The browser's localStorage, sessionStorage, and cookies — all key-value stores you can inspect under DevTools → Application — are databases. Even a plain text file you jot notes into is a very basic database. The common thread: a persistent system that lets you create, read, update, and delete data — CRUD.

What "database" means in a backend context

That generic definition isn't what engineers usually mean. In the context of servers and backend systems, "database" almost always means a disk-based database — data stored on a hard disk (HDD) or solid-state drive (SSD). Why disk? That's the next section, and it's the most important foundational idea here.

Mental model

A database is a persistence layer offering CRUD. In backend work specifically, it's a disk-based persistence layer — chosen deliberately for capacity over raw speed. Everything else in this chapter follows from that one trade-off.

Disk vs RAM

Your CPU can reach two kinds of memory. RAM (primary memory) is extremely fast but expensive and limited in supply — most machines have 8, 16, 32, maybe 64–128 GB. Disk (secondary memory) — HDD or SSD — is much cheaper and far more plentiful, commonly 512 GB to several TB, but it's slower to read from and write to.

The core trade-off

Caches (Redis, in-memory stores) live in RAM because fetching from primary memory is blazing fast. Traditional databases live on disk because what matters most for storage is capacity — and we can trade some speed for far more space at far less cost.

So the pattern is: caching = RAM (speed-critical, small), databases = disk (capacity-critical, large). Postgres, MySQL, MongoDB — all store their data on disk. How they lay bytes out on disk is a deep, fascinating area, but as an application-level backend engineer the essential takeaway is just this trade-off. (And it's exactly why indexes, in section 11, matter so much: they're how we claw back speed from slow disk lookups.)

DBMS & Why Not Just Text Files

Storing data on disk isn't enough — we need efficient ways to retrieve, change, and delete it, across hundreds of gigabytes. That's the job of a DBMS (Database Management System): software whose sole purpose is to store data and provide efficient CRUD operations on it. On a high level a DBMS has two jobs — store the data and serve operations on it — plus a set of responsibilities:

Responsibility	What it means
Data organization	Lay data out so fetch/update/insert are efficient.
Access	Provide methods for CRUD — create, read, update, delete.
Integrity	Keep data accurate and valid — e.g. a `payment` field accepts only numbers, never a stray string.
Security	Protect data from unauthorized access via users and roles.

Why a plain text file falls apart

Before DBMS software existed, people tried storing everything in text files. It breaks down for three concrete reasons:

1 · Parsing is slow and error-prone

To find one customer you'd write application code to read the file, split every line, and compare each field — a slow operation, and slower still in languages like JavaScript or Python than in something like Rust. Worse, hand-rolled parsing is fragile: one mistake and you corrupt data or hand a customer the wrong record.

2 · There's no structure

Text files have no formal schema. You can dump any text in any shape, so you cannot enforce "this field must always be a number." There's no way to guarantee consistency — it's all just strings.

3 · No concurrency control

This is the killer. What happens when two people update the same value at once?

The lost-update race — amount starts at 40

Both read 40 at the same time. A computes 60, B computes 20, and whichever writes last clobbers the other — you can't predict which. A text file simply can't arbitrate this. A real DBMS uses transactions and locking to make concurrent writes consistent.

These three failures — slow parsing, no structure, no concurrency control — are exactly why DBMS software exists. Try to grow a text-file "database" today and you'll hit every one of them.

Relational vs Non-Relational

On a high level there are two major families of DBMS.

Two ways to shape data

Relational (SQL)

A relational database organizes data into tables of rows and columns, with relationships between tables defined by foreign keys. Its defining feature is a strict, predefined schema: before inserting anything you must declare the table, its columns, and each column's data type. Nothing happens "on the fly." That strictness buys the big payoff — data integrity: at any moment you can bet on the shape and types of your data. You interact with it using SQL (Structured Query Language). Examples: PostgreSQL, MySQL, SQL Server.

Non-relational (NoSQL)

NoSQL databases like MongoDB don't enforce a schema. In Mongo's vocabulary a table is a collection and a row is a document, and two documents in the same collection can have completely different shapes. The advantage is flexibility — great when prototyping fast or when you genuinely don't know the data's shape in advance. The cost is weaker integrity: with no schema enforced at the database level, you must enforce correctness in application code, which adds complexity and is more error-prone (application code changes constantly and bugs slip in).

Picking with two examples

CRM (customer relationship management) → relational. Contacts, sales opportunities, and their relationships are critical, structured data that demands strong integrity and complex queries. Postgres is a great fit.
CMS (content management system) → non-relational (or JSON in Postgres). An article can contain an image, a code block, a YouTube embed — wildly varied content you can't pin to a fixed schema. "Take it all and dump it" suits a document store.

Why Postgres

Among all the options, Postgres is the sensible default for most projects, for five reasons:

Open source & free. Not proprietary — companies can self-host it on their own servers and inspect the source.
SQL-standard compliant. Standard SQL runs the same way it would on MySQL or SQL Server, so migrating to another engine later means far less rework.
Extensible. The docs run ~1400 pages; it covers nearly every use case a typical SaaS hits, and has a strong extension system to customize it.
Reliable & scalable. A proven track record at scale.
Excellent JSON support — the deciding feature. The main reason people reach for MongoDB is "store any JSON." But Postgres offers a JSON/JSONB type with strong indexing and query capability, so you can store dynamic, schemaless data inside a relational database. That removes the usual reason to leave Postgres for a NoSQL store.

The pragmatic call

You'll find articles claiming MySQL has performance edges — but until you're serving millions of users and chasing a specific bottleneck, that difference won't matter. The rich feature set plus first-class JSON makes Postgres a strong first choice for nearly any project. That's what the rest of this chapter uses.

SQL vs Postgres — the distinction

SQL is the language you write queries in. Postgres is the database system — the software that executes those queries. This chapter assumes you've picked up SQL basics (SELECT, CREATE TABLE, ORDER BY, GROUP BY) elsewhere, and focuses on the concepts that actually matter day-to-day in backend work.

Data Types

Every column in a relational table has a fixed type. Here are the ones you'll actually use, with the decisions that matter.

Type	What it stores · when to use
`serial` / `bigserial`	Auto-incrementing integer (1, 2, 3…) when you omit it on insert. Use `bigserial` in production for higher capacity. Common for integer IDs / primary keys.
`smallint` · `integer` · `bigint`	Integers of increasing capacity. Pick by the largest number you need to store.
`decimal(10,2)` / `numeric`	Exact-precision number — here, up to 10 total digits, 2 after the decimal. Always use for money/price.
`real` · `double precision`	Floating-point. Fast, but values can differ slightly across systems. Use for scientific/measurement data where tiny discrepancies don't matter.
`char(n)`	Fixed-length text, padded with spaces to `n`. Avoid — legacy.
`varchar(n)`	Variable-length text with a max of `n`. The famous `255` is just a MySQL habit, meaningless in Postgres.
`text`	Variable-length, no limit. Prefer this in Postgres (see below).
`boolean`	`true` / `false`.
`date` · `time` · `timestamp`	A date, a time, or both together.
`timestamptz`	Timestamp with time zone — stores the zone too. Prefer for created/updated columns.
`interval`	A span like "10 days" or "1 week".
`uuid`	Universally unique identifier. Great for primary keys, especially in distributed systems. Postgres has it natively.
`json` · `jsonb`	JSON data. Prefer `jsonb` (binary) — Postgres serializes it to its own format so it can index and query it fast. Plain `json` is stored as text.
`type[]` (array)	An array of any type — `integer[]`, `text[]`, etc.
inet · macaddr · point · xml	Network addresses, MAC addresses, geometric points, XML. Rarely needed day-to-day.

Decimal vs float — the money rule

First, decode the parameters: decimal(10,2) means 10 total digits, with 2 after the decimal point. So you can store 12345678.90 — count them, that's 10 digits across the whole number, two of them on the right of the point. Try to store a number needing 11 digits and the database rejects it.

Floating-point numbers are fast but represented in a way that can differ slightly across systems, so the same value can render as 89.87 here and 89.9 there. For a price, where the number feeds calculations and accuracy is critical, that's unacceptable — always use decimal/numeric. For something like an area measurement (567.8987) where a tiny rounding difference is harmless, use float for the speed.

And the natural follow-up: if decimals are always accurate, why not use them everywhere? Because that exactness costs performance — floats are stored and computed faster, which is exactly why scientific/measurement domains prefer them. Rule of thumb: accuracy matters → decimal; speed matters and small discrepancies are fine → float.

char vs varchar vs text — prefer text

All three store text; the difference is how they handle length. char(10) always pads to the full length with spaces — store "ab" and the database adds 8 trailing spaces before saving, because a char field always occupies its declared length. varchar(255) stores only what you give (storing "ab" uses length 2, "abcd" uses 4) up to a maximum of 255 — exceed it and you get a database-level error. text is varchar with no length limit at all (technically a very high ceiling, ~1 GB, which you'll never approach).

Storing "ab" in each string type

Postgres's own docs recommend text — there's essentially no performance difference between them, and you can index text just fine (a common misconception borrowed from other engines like MySQL/SQL Server, which behave differently and confuse people switching over). Two practical reasons to default to text:

No migration headaches. If you set varchar(255) and later need longer values, you must run a schema migration that touches lots of data — riskier than just enforcing length in application code. As backend engineers we interact with the DB through a driver and have application code anyway, so enforce length there and keep the migration simple.
Self-documenting. A number like varchar(255) looks meaningful, so a new teammate wastes time hunting for why it's 255 (answer: it isn't meaningful — it's a copied MySQL habit). text reads cleanly as "a text field," with no phantom number to explain.

Only use char(n) when the length is genuinely always identical — e.g. two-letter day codes Mo, Tu, We — where padding never happens and you save space versus variable-length overhead.

json vs jsonb — prefer jsonb

json is stored as plain text; jsonb is serialized into Postgres's own binary format. Almost always choose jsonb — it queries and indexes far faster. (It's a Postgres feature, not standard SQL.)

Migrations

In production you don't just open a GUI like TablePlus and start running queries — there'd be no record of what changed, when, or who did it, and no way to control the database's state over time. Instead, databases use migrations: version control for your schema.

How the workflow works

You keep a folder of sequentially-named SQL files. A command-line tool (dbmate, golang-migrate, etc.) walks them in order and applies the SQL inside each to the configured database.

Migration files run in sequence; a tracking table remembers where you are

Each file has two halves: an up migration (the change — CREATE TABLE, CREATE INDEX…) and a down migration (the exact reverse — DROP TABLE…). The down half exists so that if a change breaks production you can roll back to the previous state cleanly. The tool maintains a schema_migrations table holding the current version, so it never re-runs an already-applied file (which would error with "table already exists").

a dbmate migration file — up & down halves

-- migrate:up
CREATE TABLE users (
  id    UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email TEXT NOT NULL UNIQUE
);

-- migrate:down
DROP TABLE users;

Why migrations matter

Tracked history. The files live in git alongside your code, so every schema change over months or years is recorded, reviewable, and reproducible on any machine or server.
Rollback. The down migrations let you revert a bad change to a known-good version.

Seeding

For local/dev testing you want sample data to work against before deploying. That's seeding — a separate migration that inserts test rows (sample users, projects, etc.). Best practice keeps seed data in its own migration file so it's clearly distinct from schema changes.

Modeling Relationships

We'll model a project-management platform (think Jira/Linear). From the requirements we pull the nouns → tables: users, user_profiles, projects, tasks, project_members.

Naming conventions

Plural table names (users, projects) by industry standard, and snake_case columns (full_name, password_hash). Why snake_case? Postgres is case-insensitive by nature — write fullName and it reads it as fullname unless you wrap it in double quotes everywhere, which makes application code ugly. Lowercase snake_case sidesteps the whole problem.

The three relationship shapes are the heart of data modeling. Here's how each is implemented:

The three relationship shapes

One-to-one — users ↔ user_profiles

Why split profile data into its own table instead of piling it onto users? Because a profile grows over time (bio, phone, avatar, social links, websites…) and would force constant migrations on the core users table. Abstracting it into a separate table that can evolve independently keeps the primary table lightweight. Since each user has exactly one profile, you don't even need a separate ID — make the profile's user_id serve as both its primary key and its foreign key.

One-to-many — projects ↔ tasks

One project has many tasks; each task belongs to one project. Implement it by putting a project_id foreign key on the tasks table (not a primary key). Many task rows can share the same project_id — that's the "many" side.

Many-to-many — users ↔ projects

A user can be in many projects, and a project can have many users — and you want to query it both directions. When the same relationship makes sense from both sides, it's many-to-many. You can't express that with a single foreign key, so you create a linking table (project_members) holding two foreign keys — user_id and project_id — and make the pair a composite primary key. That combination is implicitly unique and not-null, so "user 1 is in project 2" can exist exactly once.

the full schema — enums, then the five tables

-- custom enum types: a fixed set of allowed values
CREATE TYPE project_status AS ENUM ('active', 'completed', 'archived');
CREATE TYPE task_status    AS ENUM ('pending', 'in_progress', 'completed', 'cancelled');
CREATE TYPE member_role    AS ENUM ('owner', 'admin', 'member');

CREATE TABLE users (
  id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  email         TEXT NOT NULL UNIQUE,
  full_name     TEXT NOT NULL,
  password_hash TEXT NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ONE-TO-ONE: user_id is BOTH primary key and foreign key
CREATE TABLE user_profiles (
  user_id     UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
  avatar_url  TEXT,             -- optional → no NOT NULL
  bio         TEXT,
  phone       TEXT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE projects (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name        TEXT NOT NULL,
  description TEXT,
  status      project_status NOT NULL DEFAULT 'active',
  owner_id    UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- ONE-TO-MANY: a project has many tasks via project_id
CREATE TABLE tasks (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  project_id  UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  title       TEXT NOT NULL,
  description TEXT,
  priority    INTEGER NOT NULL DEFAULT 1 CHECK (priority BETWEEN 1 AND 5),
  status      task_status NOT NULL DEFAULT 'pending',
  due_date    DATE,
  assigned_to UUID REFERENCES users(id) ON DELETE SET NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- MANY-TO-MANY: linking table with a COMPOSITE primary key
CREATE TABLE project_members (
  project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
  user_id    UUID NOT NULL REFERENCES users(id)    ON DELETE CASCADE,
  role       member_role NOT NULL DEFAULT 'member',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (project_id, user_id)   -- the pair is unique & not-null
);

Version note · gen_random_uuid()

This schema is verified against PostgreSQL 16. gen_random_uuid() is built into Postgres core from 13 onward. On older versions (≤12) you must first enable the crypto extension: CREATE EXTENSION IF NOT EXISTS pgcrypto; — then the same call works.

Constraints & Integrity

Constraints are conditions you attach to a column so the database itself protects the data from going corrupt. The whole philosophy: push integrity checks down to the database rather than trusting application code alone. App code is one layer of defense; a database constraint is the guarantee.

Primary key

A unique field that identifies one row. It implicitly carries two constraints — not null and unique — plus foreign-key referencing behavior. By convention every table has an id, often UUID PRIMARY KEY DEFAULT gen_random_uuid() so Postgres generates a random ID when you omit it on insert.

Not null — the 70% rule

By default every Postgres column allows null unless you say otherwise. More than 70% of your columns should be NOT NULL, because nulls quietly creep in — especially via automated scripts with bugs — and leave you with a messy, inconsistent table. Add NOT NULL unless a field is genuinely optional (like a user's bio or phone).

Unique

No two rows may share that value — e.g. email TEXT NOT NULL UNIQUE means inserting a duplicate email throws a database-level error.

Foreign key constraint

REFERENCES users(id) guarantees that whatever you put in assigned_to or owner_id must be a real, existing user ID. Pass a random (even valid-looking) UUID with no matching user and the insert fails. You can't reference something that doesn't exist.

Check constraint

A custom rule the row must satisfy — the transaction only goes through while the condition holds. CHECK (priority BETWEEN 1 AND 5) blocks anyone from inserting priority = 55.

Enums — integrity + documentation

An enum is a custom type limited to a fixed set of values (e.g. project_status ∈ {active, completed, archived}). Two wins:

Data integrity — insert any other string into a project_status column and you get a database-level error; you don't have to rely solely on app-code validation.
Self-documentation — arguably the bigger win. A future teammate reading the migration sees the allowed values at a glance. If the column were plain text, they'd have to grep the entire application code to discover which strings are actually valid.

Referential integrity — ON DELETE behavior

When you delete a row other tables point at, what happens to those references? You declare it per foreign key:

Clause	On deleting the referenced row…	Used here for
`ON DELETE RESTRICT`	Block the delete if any child rows still reference it.	`projects.owner_id` — can't delete a user who still owns projects.
`ON DELETE CASCADE`	Delete the children too.	`tasks.project_id` — deleting a project deletes its tasks.
`ON DELETE SET NULL`	Set the child's reference to null (column must allow null).	`tasks.assigned_to` — deleting a user un-assigns their tasks.
`ON DELETE SET DEFAULT`	Set the child's reference to its declared default.	—

The integrity mindset

Every constraint — primary key, not null, unique, foreign key, check, enum, ON DELETE — exists so the database guarantees correctness, not just hopeful application code. The more you encode at this level, the harder it is for a future bug to corrupt your data.

Querying & Joins

Build queries from FROM, not SELECT

A habit worth adopting: when writing a SQL query, start with the FROM clause, not SELECT. Decide where your data comes from first — which tables, joined how — and only then decide what to pull out. It keeps you oriented: you always know the source before you reach for columns. We'll build the "list users with profile" query in exactly that order — source first, then shape, then ordering.

Source. FROM users u — and give it a short alias (u) so you can refer to it concisely throughout the query.

Join the related table. The profile lives elsewhere, linked by the foreign key, so LEFT JOIN user_profiles up ON u.id = up.user_id.

Shape. Now pick columns: u.* plus the folded profile, to_jsonb(up.*) AS profile.

Order. Finish with ORDER BY u.created_at DESC so results are deterministic and newest-first.

Joining tables — and why LEFT JOIN

To return each user with their profile in one API call, you join users to user_profiles on the foreign key. Use LEFT JOIN, not INNER JOIN: a user might never have created a profile row, and an INNER JOIN only returns rows where both sides have a match — so those profile-less users would silently vanish from the result, returning fewer users than exist. A LEFT JOIN keeps every row from the left (main) table regardless, filling the profile side with null when there's no match. Since we always want the user back, profile or not, LEFT JOIN is correct here. (INNER and LEFT joins cover the vast majority of real queries; the others are rare.)

Two more details in that query. The short aliases (u, up) just make a long query readable — you assign them once in FROM/JOIN and reuse them everywhere. And to_jsonb(up.*) is a built-in Postgres function that converts the matched profile row into a JSON object, embedded as a single profile field on each user — so the frontend gets user.profile nested, in one round trip, instead of making a second call. Always end a list query with an ORDER BY (here created_at DESC): query results come back in arbitrary order otherwise, and that order can change between calls, so you sort explicitly — usually newest-first by creation time.

GET /v1/users — list users with embedded profile

SELECT
  u.*,
  to_jsonb(up.*) AS profile          -- fold the profile row into one JSON field
FROM users u
LEFT JOIN user_profiles up           -- LEFT so users without a profile still appear
  ON u.id = up.user_id               -- the foreign-key join condition
ORDER BY u.created_at DESC;          -- newest first; never rely on default order

Parameterized queries — never concatenate

This is the single most important security habit with databases. A parameterized query leaves a placeholder slot and supplies the value separately; the database treats whatever you pass as pure data (a string), never executable SQL. If an attacker passes '; DELETE FROM users; -- into a parameter, it's stored/compared as a harmless literal string, not run as a command.

SQL injection — the cardinal sin

The vulnerability appears when you build queries by concatenating strings with user input. Then a crafted input can break out and run its own SQL — reading, altering, or wiping your data. Never string-concatenate user values into a query. Always use parameter placeholders; your driver/ORM handles the safe substitution. This isn't optional.

GET /v1/users/:id — same query, parameter-safe in each layer

SELECT u.*, to_jsonb(up.*) AS profile
FROM users u
LEFT JOIN user_profiles up ON u.id = up.user_id
WHERE u.id = $1;        -- $1 is a parameter slot; the value is sent separately

// database/sql — the driver substitutes the parameter safely
row := db.QueryRow(`
  SELECT u.*, to_jsonb(up.*) AS profile
  FROM users u
  LEFT JOIN user_profiles up ON u.id = up.user_id
  WHERE u.id = $1`, userID) // userID is treated strictly as data, never SQL

# psycopg — pass params as a tuple; NEVER use f-strings to build SQL
cur.execute(
    """
    SELECT u.*, to_jsonb(up.*) AS profile
    FROM users u
    LEFT JOIN user_profiles up ON u.id = up.user_id
    WHERE u.id = %s
    """,
    (user_id,),  # the driver escapes it; injection is impossible
)

Insert & update

POST create · PATCH partial update

-- POST /v1/users  → returning the new row back to the client
INSERT INTO users (email, full_name, password_hash)
VALUES ($1, $2, $3)
RETURNING *;

-- PATCH /v1/users/:id  → only the fields the client actually sent
UPDATE user_profiles
SET bio = $1, phone = $2
WHERE user_id = $3
RETURNING *;

Dynamic filter, sort & pagination

List endpoints support filtering, sorting, and pagination — built dynamically in app code from the query params, with sane defaults (no page → 1, no limit → 10, no sort → created_at DESC). At the SQL level:

Filter with ILIKE — case-insensitive pattern match. WHERE u.full_name ILIKE $1 || '%' matches names starting with a letter regardless of case.
Sort from an allow-list of columns the client may sort by (full_name, email, created_at) plus a direction.
Paginate with LIMIT + OFFSET. Note: the API's page 1 maps to database OFFSET 0 — the offset is zero-based.

list with filter + sort + pagination

SELECT u.*, to_jsonb(up.*) AS profile
FROM users u
LEFT JOIN user_profiles up ON u.id = up.user_id
WHERE u.full_name ILIKE $1 || '%'   -- filter: names starting with $1 (case-insensitive)
ORDER BY u.created_at DESC           -- sort (column from an allow-list, direction chosen)
LIMIT $2 OFFSET $3;                  -- page 1 → OFFSET 0, page 2 → OFFSET (limit), ...

Indexes

Indexes are the biggest performance lever you have, and they're best understood through the book analogy. A book's index lists each chapter and its page number, so to reach chapter 4 you don't flip through 50 pages one by one — you look it up and jump straight there.

The problem: sequential scan

Remember databases store rows scattered across the disk, not in neat order. Without an index, finding one task by ID means a sequential scan: the database checks row after row, location after location, until it finds a match. With six rows that's instant; with a million or a billion rows it's painfully slow.

Sequential scan vs index lookup

The solution: a lookup table

An index is essentially a separate, ordered lookup table mapping each indexed value to the disk location of its row. Query by an indexed field and the database scans this compact, sorted structure (internally a B-tree), finds the entry fast, reads the location, and jumps straight to the row — no full scan. Two properties matter:

Direct lookup — value → location, instead of checking every row.
Order — an index can be ascending or descending. If you frequently run ORDER BY created_at DESC, build the index descending so the rows come back already in order, with no extra sort step.

When to create an index

The rule of thumb — index a column when it's used in a WHERE clause, a JOIN condition, or an ORDER BY. One important detail: primary keys are auto-indexed, but foreign keys are not — so foreign keys you join on (like tasks.project_id, tasks.assigned_to) are prime index candidates.

Walking the actual reasoning behind each index in our schema makes the rule concrete:

users.email → there's a "find user by email" lookup and joins on email, so it appears in WHERE/JOIN → index it.
users.created_at DESC → the list-users API sorts newest-first by default and is called constantly, so index it descending to match the ORDER BY exactly.
tasks.project_id → "fetch all tasks of a project" joins projects to tasks on this foreign key. The projects.id side is the primary key (already indexed), but project_id here is a foreign key and isn't — so index it.
tasks.assigned_to → "fetch all tasks for a user" joins on this foreign key. Same logic — foreign key, not auto-indexed, used in a join → index it.
tasks.status → "fetch all pending tasks" filters on status in a WHERE clause → index it.
tasks.created_at DESC → the list-tasks API sorts newest-first → index descending.

Internally a B-tree keeps these entries sorted, which is what makes both the direct lookup and the ordered scan fast — but you don't need the internals to apply the rule; the three triggers (WHERE / JOIN / ORDER BY) are enough to decide.

indexes chosen from real WHERE / JOIN / ORDER BY usage

CREATE INDEX idx_users_email        ON users (email);              -- WHERE / JOIN on email
CREATE INDEX idx_users_created_at    ON users (created_at DESC);    -- ORDER BY newest first
CREATE INDEX idx_tasks_project_id    ON tasks (project_id);         -- FK join (not auto-indexed)
CREATE INDEX idx_tasks_assigned_to   ON tasks (assigned_to);        -- FK join (not auto-indexed)
CREATE INDEX idx_tasks_status        ON tasks (status);             -- WHERE status = ...
CREATE INDEX idx_tasks_created_at    ON tasks (created_at DESC);    -- ORDER BY newest first

The write-overhead trade-off

Indexes speed up reads but slightly slow down writes: every INSERT/UPDATE must also keep the index up to date. So don't blindly index everything. Evaluate how frequently the query runs and whether the read win is worth the write cost. Practical move: create indexes for hot WHERE/JOIN/ORDER BY columns, then monitor — drop ones whose queries turn out to be rare.

Triggers

Every table here has an updated_at column that should be refreshed to "now" each time the row changes. You could set it manually in every UPDATE — or you can automate it with a trigger so the application never has to think about it.

A trigger runs a function automatically when a condition is met. Here: define a function that sets NEW.updated_at = now() and returns the row, then attach a BEFORE UPDATE trigger on each table that calls it.

UPDATE fires the trigger automatically

one function, reused as a trigger on every table

-- the function: stamp updated_at, return the modified row
CREATE FUNCTION set_updated_at() RETURNS TRIGGER AS $$
BEGIN
  NEW.updated_at = now();
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- attach it before every UPDATE on each table
CREATE TRIGGER trg_users_updated
  BEFORE UPDATE ON users
  FOR EACH ROW EXECUTE FUNCTION set_updated_at();

CREATE TRIGGER trg_projects_updated
  BEFORE UPDATE ON projects
  FOR EACH ROW EXECUTE FUNCTION set_updated_at();
-- ...and likewise for user_profiles, tasks, project_members

Now any UPDATE — from anywhere, app code or a manual query — automatically refreshes updated_at. No column gets forgotten, and the logic lives in one place.

The chapter in one breath

Databases persist data on cheap, roomy disk; a DBMS gives you efficient, safe CRUD that text files can't. Default to Postgres for its integrity plus JSONB. Model with the right relationships (one-to-one, one-to-many, many-to-many via a linking table), enforce correctness with constraints & enums, evolve safely through migrations, query with joins and always-parameterized SQL, and reach for the two performance levers — indexes (on WHERE/JOIN/ORDER BY columns) and triggers (to automate the tedious). That's ~80% of database work as a backend engineer.

Transactions & ACID

Back in section 3 we saw why a text file can't handle two people editing the same value at once — the lost-update race. A real DBMS solves that with transactions: a group of operations that either all succeed together or none do. This is the machinery that makes concurrency safe, and it's worth understanding because money, inventory, and any "read-then-write" logic depend on it.

The classic example — a money transfer

Move ₹500 from account A to account B. That's two writes: subtract from A, add to B. If the server crashes between them, A has lost ₹500 that never arrived at B — money vanished. A transaction makes the pair atomic: wrap them in BEGIN … COMMIT, and either both happen or, on any failure, ROLLBACK undoes everything as if nothing occurred.

both writes commit together, or neither does

BEGIN;                                              -- start the transaction
  UPDATE accounts SET balance = balance - 500 WHERE id = $1;  -- debit A
  UPDATE accounts SET balance = balance + 500 WHERE id = $2;  -- credit B
COMMIT;                                             -- both are now permanent
-- if anything fails between BEGIN and COMMIT → ROLLBACK undoes it all

tx, err := db.Begin()
if err != nil { return err }
defer tx.Rollback() // safety net: a no-op if we already committed

if _, err = tx.Exec(`UPDATE accounts SET balance = balance - 500 WHERE id = $1`, a); err != nil {
	return err // deferred Rollback fires → A is untouched
}
if _, err = tx.Exec(`UPDATE accounts SET balance = balance + 500 WHERE id = $1`, b); err != nil {
	return err // deferred Rollback fires → both reverted
}
return tx.Commit() // only here do both writes become permanent

conn = pool.getconn()
try:
    with conn:                       # commits on success, ROLLBACKs on any exception
        with conn.cursor() as cur:
            cur.execute("UPDATE accounts SET balance = balance - 500 WHERE id = %s", (a,))
            cur.execute("UPDATE accounts SET balance = balance + 500 WHERE id = %s", (b,))
    # reaching here means both writes committed atomically
finally:
    pool.putconn(conn)

ACID — the four guarantees

Transactions give you four properties, abbreviated ACID:

Property	Guarantee	In the transfer example
Atomicity	All operations succeed, or none do.	You never debit A without crediting B.
Consistency	The DB moves from one valid state to another; constraints always hold.	Total money across accounts stays correct.
Isolation	Concurrent transactions don't step on each other; each runs as if alone.	Two transfers at once can't corrupt a balance.
Durability	Once committed, it survives crashes/power loss.	After COMMIT, the transfer is permanent.

This is a major reason to reach for a relational database like Postgres when data is critical — ACID transactions are a first-class guarantee, exactly the thing the section-3 text file could never provide.

Isolation & the lost-update race, revisited

Isolation is the property that actually fixes our two-writers-on-40 problem. Under the hood the database uses locking and MVCC (multi-version concurrency control) so that concurrent transactions see a consistent snapshot and serialize their conflicting writes, rather than blindly overwriting each other.

For read-then-write logic specifically, there are two standard tools:

Atomic write — push the computation into the SQL itself: UPDATE accounts SET balance = balance - 500 … reads and writes in one locked step, so there's no gap for another writer to slip into. (Far safer than reading the balance into app code, computing, and writing it back.)
Row locking — SELECT … FOR UPDATE locks the rows you read until your transaction ends, so a concurrent transaction must wait its turn instead of racing you.

SELECT ... FOR UPDATE — lock the row, then safely modify

BEGIN;
  -- lock this account's row so no other transaction can change it until we're done
  SELECT balance FROM accounts WHERE id = $1 FOR UPDATE;
  -- ... compute safely in app code ...
  UPDATE accounts SET balance = $2 WHERE id = $1;
COMMIT;

Don't read-then-write in app code without protection

The naive pattern — SELECT balance into your program, add in code, UPDATE it back — is precisely the section-3 race condition, just moved into your backend. Two requests both read the old balance and one overwrites the other. Use an atomic SQL update (balance = balance - 500) or FOR UPDATE row locking inside a transaction. This is one of the most common real-world concurrency bugs.

Where transactions fit

Any time a single logical operation spans multiple writes that must all hold together — transfers, "create order + decrement stock," "archive org + cascade cleanup" — wrap it in a transaction. ACID is what lets you treat that group as one indivisible, crash-safe, concurrency-safe unit. It's the database keeping the promise a text file never could.

Production Essentials

The transcript stops at the fundamentals. These four topics aren't in it, but you'll hit every one of them the moment your database serves real traffic. They're the difference between "it works on my machine with six rows" and "it works at scale."

The N+1 query problem — the most common DB performance bug

You fetch a list of N projects, then loop over them and fire one more query per project to load its tasks. That's 1 + N queries — for 100 projects, 101 round trips to the database. Each round trip has network + planning overhead, so this is death by a thousand cuts. It's especially insidious with ORMs, where lazy-loading a relationship inside a loop hides the extra queries behind innocent-looking code.

N+1 vs a single join

The fix is to fetch related data in one query — a JOIN, or a batched WHERE project_id = ANY($1), or an aggregate that builds nested JSON server-side. With an ORM, the equivalent is "eager loading" (e.g. JOIN/preload/include) instead of lazy loading inside a loop.

N+1 collapsed into one query with aggregated tasks

-- instead of: SELECT * FROM projects;  then per row: SELECT * FROM tasks WHERE project_id = ?
SELECT
  p.*,
  COALESCE(jsonb_agg(t.*) FILTER (WHERE t.id IS NOT NULL), '[]') AS tasks
FROM projects p
LEFT JOIN tasks t ON t.project_id = p.id
GROUP BY p.id;        -- one round trip, every project with its tasks nested

Connection pooling

Opening a Postgres connection is expensive — Postgres spawns a process per connection, and there's a hard ceiling (default ~100 max_connections). If every web request opens its own connection, you exhaust the server and pay the setup cost constantly. The fix is a connection pool: a small set of long-lived connections that requests borrow and return.

Many requests, a few reused connections

Two layers, often combined: the driver's built-in pool (Go's database/sql pools automatically; Python uses psycopg_pool/SQLAlchemy's pool) for a single app instance, and an external pooler like PgBouncer when many app instances would otherwise each hold their own pool and overwhelm Postgres. Tune pool size to your workload — bigger isn't better; too many connections degrade Postgres.

EXPLAIN ANALYZE — verify the index is actually used

Creating an index doesn't guarantee the planner uses it. EXPLAIN shows the query plan; EXPLAIN ANALYZE actually runs the query and reports real timings. You read it looking for "Index Scan" (good — your index is working) versus "Seq Scan" (the planner read every row).

Verified live · this is real output

Running EXPLAIN SELECT * FROM tasks WHERE status = 'pending' against the chapter's actual schema on Postgres 16 produced: Bitmap Index Scan on idx_tasks_status — confirming the status index from section 11 is genuinely used for that filter, not just created and ignored.

One caveat that confuses people: on a tiny table the planner may correctly choose a Seq Scan anyway (reading 6 rows directly is faster than the index round-trip). That's not a bug — index benefits show up at scale. EXPLAIN ANALYZE on production-sized data is how you confirm an index earns its keep, and how you debug a slow query that should be using one but isn't.

Soft deletes

Often you don't want a row truly gone — you want it kept for audit, recovery, or historical integrity, but hidden from normal queries. That's a soft delete: add a nullable deleted_at TIMESTAMPTZ column (null = live, a timestamp = deleted). "Deleting" becomes an UPDATE; every read filters out the deleted rows.

soft delete · the column, the "delete", the reads, and the unique-index gotcha

ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ;   -- null = alive

-- "delete" = mark, don't remove
UPDATE users SET deleted_at = now() WHERE id = $1;

-- every read must exclude the dead rows
SELECT * FROM users WHERE deleted_at IS NULL ORDER BY created_at DESC;

-- gotcha: a plain UNIQUE now blocks reusing a soft-deleted email.
-- a PARTIAL unique index fixes it — uniqueness only among LIVE rows:
CREATE UNIQUE INDEX uniq_active_email
  ON users (email) WHERE deleted_at IS NULL;

The trade-offs are real: every query must remember the WHERE deleted_at IS NULL filter (easy to forget — a leak), unique constraints need the partial-index treatment shown above, and the table only ever grows. Use soft deletes where history matters (orders, user accounts, anything auditable); use hard DELETE for genuinely disposable data.

Production mindset

The fundamentals get you a correct database; these get you a fast, safe, operable one. Collapse N+1 into joins, pool your connections, prove your indexes with EXPLAIN ANALYZE, and reach for soft deletes when data must survive its own deletion. None of it is exotic — it's the day-two reality of running a database in production.

Mastering Databaseswith Postgres

Why Databases Exist

"Database" is a surprisingly broad word

What "database" means in a backend context

Disk vs RAM

DBMS & Why Not Just Text Files

Why a plain text file falls apart

1 · Parsing is slow and error-prone

2 · There's no structure

3 · No concurrency control

Relational vs Non-Relational

Relational (SQL)

Non-relational (NoSQL)

Picking with two examples

Why Postgres

Data Types

Decimal vs float — the money rule

char vs varchar vs text — prefer text

json vs jsonb — prefer jsonb

Migrations

How the workflow works

Why migrations matter

Seeding

Modeling Relationships

One-to-one — users ↔ user_profiles

One-to-many — projects ↔ tasks

Many-to-many — users ↔ projects

Constraints & Integrity

Primary key

Not null — the 70% rule

Unique

Foreign key constraint

Check constraint

Enums — integrity + documentation

Referential integrity — ON DELETE behavior

Querying & Joins

Build queries from FROM, not SELECT

Joining tables — and why LEFT JOIN

Parameterized queries — never concatenate

Insert & update

Dynamic filter, sort & pagination

Indexes

The problem: sequential scan

The solution: a lookup table

When to create an index

Triggers

Transactions & ACID

The classic example — a money transfer

ACID — the four guarantees

Isolation & the lost-update race, revisited

Production Essentials

The N+1 query problem — the most common DB performance bug

Connection pooling

EXPLAIN ANALYZE — verify the index is actually used

Soft deletes

Mastering Databases
with Postgres