Mastering Databases
with Postgres
Interacting with a database is the single most frequent thing a backend engineer does. This chapter builds the whole picture from the ground up — why databases exist at all, why they live on disk, how to model real relationships, how to query them safely and fast, and the two performance levers (indexes and triggers) that separate a toy from production. Everything is grounded in one running example: a project-management platform.
Why Databases Exist
At its core a database is simply a way to persist information across sessions. Persistence means storing data so it survives even after the program that created it has stopped. Think of a to-do app: you add items, check some off, close the app — and when you reopen it, everything is exactly as you left it. Without persistence you'd start from an empty list every single time and lose all your progress. That continuity, across time and across machines, is the whole point.
"Database" is a surprisingly broad word
In the simplest sense, any structured storage is a database. Your phone's contact list is a database. The browser's localStorage, sessionStorage, and cookies — all key-value stores you can inspect under DevTools → Application — are databases. Even a plain text file you jot notes into is a very basic database. The common thread: a persistent system that lets you create, read, update, and delete data — CRUD.
What "database" means in a backend context
That generic definition isn't what engineers usually mean. In the context of servers and backend systems, "database" almost always means a disk-based database — data stored on a hard disk (HDD) or solid-state drive (SSD). Why disk? That's the next section, and it's the most important foundational idea here.
A database is a persistence layer offering CRUD. In backend work specifically, it's a disk-based persistence layer — chosen deliberately for capacity over raw speed. Everything else in this chapter follows from that one trade-off.
Disk vs RAM
Your CPU can reach two kinds of memory. RAM (primary memory) is extremely fast but expensive and limited in supply — most machines have 8, 16, 32, maybe 64–128 GB. Disk (secondary memory) — HDD or SSD — is much cheaper and far more plentiful, commonly 512 GB to several TB, but it's slower to read from and write to.
Caches (Redis, in-memory stores) live in RAM because fetching from primary memory is blazing fast. Traditional databases live on disk because what matters most for storage is capacity — and we can trade some speed for far more space at far less cost.
So the pattern is: caching = RAM (speed-critical, small), databases = disk (capacity-critical, large). Postgres, MySQL, MongoDB — all store their data on disk. How they lay bytes out on disk is a deep, fascinating area, but as an application-level backend engineer the essential takeaway is just this trade-off. (And it's exactly why indexes, in section 11, matter so much: they're how we claw back speed from slow disk lookups.)
DBMS & Why Not Just Text Files
Storing data on disk isn't enough — we need efficient ways to retrieve, change, and delete it, across hundreds of gigabytes. That's the job of a DBMS (Database Management System): software whose sole purpose is to store data and provide efficient CRUD operations on it. On a high level a DBMS has two jobs — store the data and serve operations on it — plus a set of responsibilities:
| Responsibility | What it means |
|---|---|
| Data organization | Lay data out so fetch/update/insert are efficient. |
| Access | Provide methods for CRUD — create, read, update, delete. |
| Integrity | Keep data accurate and valid — e.g. a payment field accepts only numbers, never a stray string. |
| Security | Protect data from unauthorized access via users and roles. |
Why a plain text file falls apart
Before DBMS software existed, people tried storing everything in text files. It breaks down for three concrete reasons:
1 · Parsing is slow and error-prone
To find one customer you'd write application code to read the file, split every line, and compare each field — a slow operation, and slower still in languages like JavaScript or Python than in something like Rust. Worse, hand-rolled parsing is fragile: one mistake and you corrupt data or hand a customer the wrong record.
2 · There's no structure
Text files have no formal schema. You can dump any text in any shape, so you cannot enforce "this field must always be a number." There's no way to guarantee consistency — it's all just strings.
3 · No concurrency control
This is the killer. What happens when two people update the same value at once?
Both read 40 at the same time. A computes 60, B computes 20, and whichever writes last clobbers the other — you can't predict which. A text file simply can't arbitrate this. A real DBMS uses transactions and locking to make concurrent writes consistent.
These three failures — slow parsing, no structure, no concurrency control — are exactly why DBMS software exists. Try to grow a text-file "database" today and you'll hit every one of them.
Relational vs Non-Relational
On a high level there are two major families of DBMS.
Relational (SQL)
A relational database organizes data into tables of rows and columns, with relationships between tables defined by foreign keys. Its defining feature is a strict, predefined schema: before inserting anything you must declare the table, its columns, and each column's data type. Nothing happens "on the fly." That strictness buys the big payoff — data integrity: at any moment you can bet on the shape and types of your data. You interact with it using SQL (Structured Query Language). Examples: PostgreSQL, MySQL, SQL Server.
Non-relational (NoSQL)
NoSQL databases like MongoDB don't enforce a schema. In Mongo's vocabulary a table is a collection and a row is a document, and two documents in the same collection can have completely different shapes. The advantage is flexibility — great when prototyping fast or when you genuinely don't know the data's shape in advance. The cost is weaker integrity: with no schema enforced at the database level, you must enforce correctness in application code, which adds complexity and is more error-prone (application code changes constantly and bugs slip in).
Picking with two examples
- CRM (customer relationship management) → relational. Contacts, sales opportunities, and their relationships are critical, structured data that demands strong integrity and complex queries. Postgres is a great fit.
- CMS (content management system) → non-relational (or JSON in Postgres). An article can contain an image, a code block, a YouTube embed — wildly varied content you can't pin to a fixed schema. "Take it all and dump it" suits a document store.
Why Postgres
Among all the options, Postgres is the sensible default for most projects, for five reasons:
- Open source & free. Not proprietary — companies can self-host it on their own servers and inspect the source.
- SQL-standard compliant. Standard SQL runs the same way it would on MySQL or SQL Server, so migrating to another engine later means far less rework.
- Extensible. The docs run ~1400 pages; it covers nearly every use case a typical SaaS hits, and has a strong extension system to customize it.
- Reliable & scalable. A proven track record at scale.
- Excellent JSON support — the deciding feature. The main reason people reach for MongoDB is "store any JSON." But Postgres offers a
JSON/JSONBtype with strong indexing and query capability, so you can store dynamic, schemaless data inside a relational database. That removes the usual reason to leave Postgres for a NoSQL store.
You'll find articles claiming MySQL has performance edges — but until you're serving millions of users and chasing a specific bottleneck, that difference won't matter. The rich feature set plus first-class JSON makes Postgres a strong first choice for nearly any project. That's what the rest of this chapter uses.
SQL is the language you write queries in. Postgres is the database system — the software that executes those queries. This chapter assumes you've picked up SQL basics (SELECT, CREATE TABLE, ORDER BY, GROUP BY) elsewhere, and focuses on the concepts that actually matter day-to-day in backend work.
Data Types
Every column in a relational table has a fixed type. Here are the ones you'll actually use, with the decisions that matter.
| Type | What it stores · when to use |
|---|---|
serial / bigserial | Auto-incrementing integer (1, 2, 3…) when you omit it on insert. Use bigserial in production for higher capacity. Common for integer IDs / primary keys. |
smallint · integer · bigint | Integers of increasing capacity. Pick by the largest number you need to store. |
decimal(10,2) / numeric | Exact-precision number — here, up to 10 total digits, 2 after the decimal. Always use for money/price. |
real · double precision | Floating-point. Fast, but values can differ slightly across systems. Use for scientific/measurement data where tiny discrepancies don't matter. |
char(n) | Fixed-length text, padded with spaces to n. Avoid — legacy. |
varchar(n) | Variable-length text with a max of n. The famous 255 is just a MySQL habit, meaningless in Postgres. |
text | Variable-length, no limit. Prefer this in Postgres (see below). |
boolean | true / false. |
date · time · timestamp | A date, a time, or both together. |
timestamptz | Timestamp with time zone — stores the zone too. Prefer for created/updated columns. |
interval | A span like "10 days" or "1 week". |
uuid | Universally unique identifier. Great for primary keys, especially in distributed systems. Postgres has it natively. |
json · jsonb | JSON data. Prefer jsonb (binary) — Postgres serializes it to its own format so it can index and query it fast. Plain json is stored as text. |
type[] (array) | An array of any type — integer[], text[], etc. |
| inet · macaddr · point · xml | Network addresses, MAC addresses, geometric points, XML. Rarely needed day-to-day. |
Decimal vs float — the money rule
First, decode the parameters: decimal(10,2) means 10 total digits, with 2 after the decimal point. So you can store 12345678.90 — count them, that's 10 digits across the whole number, two of them on the right of the point. Try to store a number needing 11 digits and the database rejects it.
Floating-point numbers are fast but represented in a way that can differ slightly across systems, so the same value can render as 89.87 here and 89.9 there. For a price, where the number feeds calculations and accuracy is critical, that's unacceptable — always use decimal/numeric. For something like an area measurement (567.8987) where a tiny rounding difference is harmless, use float for the speed.
And the natural follow-up: if decimals are always accurate, why not use them everywhere? Because that exactness costs performance — floats are stored and computed faster, which is exactly why scientific/measurement domains prefer them. Rule of thumb: accuracy matters → decimal; speed matters and small discrepancies are fine → float.
char vs varchar vs text — prefer text
All three store text; the difference is how they handle length. char(10) always pads to the full length with spaces — store "ab" and the database adds 8 trailing spaces before saving, because a char field always occupies its declared length. varchar(255) stores only what you give (storing "ab" uses length 2, "abcd" uses 4) up to a maximum of 255 — exceed it and you get a database-level error. text is varchar with no length limit at all (technically a very high ceiling, ~1 GB, which you'll never approach).
Postgres's own docs recommend text — there's essentially no performance difference between them, and you can index text just fine (a common misconception borrowed from other engines like MySQL/SQL Server, which behave differently and confuse people switching over). Two practical reasons to default to text:
- No migration headaches. If you set
varchar(255)and later need longer values, you must run a schema migration that touches lots of data — riskier than just enforcing length in application code. As backend engineers we interact with the DB through a driver and have application code anyway, so enforce length there and keep the migration simple. - Self-documenting. A number like
varchar(255)looks meaningful, so a new teammate wastes time hunting for why it's 255 (answer: it isn't meaningful — it's a copied MySQL habit).textreads cleanly as "a text field," with no phantom number to explain.
Only use char(n) when the length is genuinely always identical — e.g. two-letter day codes Mo, Tu, We — where padding never happens and you save space versus variable-length overhead.
json vs jsonb — prefer jsonb
json is stored as plain text; jsonb is serialized into Postgres's own binary format. Almost always choose jsonb — it queries and indexes far faster. (It's a Postgres feature, not standard SQL.)
Migrations
In production you don't just open a GUI like TablePlus and start running queries — there'd be no record of what changed, when, or who did it, and no way to control the database's state over time. Instead, databases use migrations: version control for your schema.
How the workflow works
You keep a folder of sequentially-named SQL files. A command-line tool (dbmate, golang-migrate, etc.) walks them in order and applies the SQL inside each to the configured database.
Each file has two halves: an up migration (the change — CREATE TABLE, CREATE INDEX…) and a down migration (the exact reverse — DROP TABLE…). The down half exists so that if a change breaks production you can roll back to the previous state cleanly. The tool maintains a schema_migrations table holding the current version, so it never re-runs an already-applied file (which would error with "table already exists").
-- migrate:up CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL UNIQUE ); -- migrate:down DROP TABLE users;
Why migrations matter
- Tracked history. The files live in git alongside your code, so every schema change over months or years is recorded, reviewable, and reproducible on any machine or server.
- Rollback. The down migrations let you revert a bad change to a known-good version.
Seeding
For local/dev testing you want sample data to work against before deploying. That's seeding — a separate migration that inserts test rows (sample users, projects, etc.). Best practice keeps seed data in its own migration file so it's clearly distinct from schema changes.
Modeling Relationships
We'll model a project-management platform (think Jira/Linear). From the requirements we pull the nouns → tables: users, user_profiles, projects, tasks, project_members.
Plural table names (users, projects) by industry standard, and snake_case columns (full_name, password_hash). Why snake_case? Postgres is case-insensitive by nature — write fullName and it reads it as fullname unless you wrap it in double quotes everywhere, which makes application code ugly. Lowercase snake_case sidesteps the whole problem.
The three relationship shapes are the heart of data modeling. Here's how each is implemented:
One-to-one — users ↔ user_profiles
Why split profile data into its own table instead of piling it onto users? Because a profile grows over time (bio, phone, avatar, social links, websites…) and would force constant migrations on the core users table. Abstracting it into a separate table that can evolve independently keeps the primary table lightweight. Since each user has exactly one profile, you don't even need a separate ID — make the profile's user_id serve as both its primary key and its foreign key.
One-to-many — projects ↔ tasks
One project has many tasks; each task belongs to one project. Implement it by putting a project_id foreign key on the tasks table (not a primary key). Many task rows can share the same project_id — that's the "many" side.
Many-to-many — users ↔ projects
A user can be in many projects, and a project can have many users — and you want to query it both directions. When the same relationship makes sense from both sides, it's many-to-many. You can't express that with a single foreign key, so you create a linking table (project_members) holding two foreign keys — user_id and project_id — and make the pair a composite primary key. That combination is implicitly unique and not-null, so "user 1 is in project 2" can exist exactly once.
-- custom enum types: a fixed set of allowed values
CREATE TYPE project_status AS ENUM ('active', 'completed', 'archived');
CREATE TYPE task_status AS ENUM ('pending', 'in_progress', 'completed', 'cancelled');
CREATE TYPE member_role AS ENUM ('owner', 'admin', 'member');
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT NOT NULL UNIQUE,
full_name TEXT NOT NULL,
password_hash TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- ONE-TO-ONE: user_id is BOTH primary key and foreign key
CREATE TABLE user_profiles (
user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
avatar_url TEXT, -- optional → no NOT NULL
bio TEXT,
phone TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE projects (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
description TEXT,
status project_status NOT NULL DEFAULT 'active',
owner_id UUID NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- ONE-TO-MANY: a project has many tasks via project_id
CREATE TABLE tasks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
title TEXT NOT NULL,
description TEXT,
priority INTEGER NOT NULL DEFAULT 1 CHECK (priority BETWEEN 1 AND 5),
status task_status NOT NULL DEFAULT 'pending',
due_date DATE,
assigned_to UUID REFERENCES users(id) ON DELETE SET NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- MANY-TO-MANY: linking table with a COMPOSITE primary key
CREATE TABLE project_members (
project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE,
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
role member_role NOT NULL DEFAULT 'member',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (project_id, user_id) -- the pair is unique & not-null
);
This schema is verified against PostgreSQL 16. gen_random_uuid() is built into Postgres core from 13 onward. On older versions (≤12) you must first enable the crypto extension: CREATE EXTENSION IF NOT EXISTS pgcrypto; — then the same call works.
Constraints & Integrity
Constraints are conditions you attach to a column so the database itself protects the data from going corrupt. The whole philosophy: push integrity checks down to the database rather than trusting application code alone. App code is one layer of defense; a database constraint is the guarantee.
Primary key
A unique field that identifies one row. It implicitly carries two constraints — not null and unique — plus foreign-key referencing behavior. By convention every table has an id, often UUID PRIMARY KEY DEFAULT gen_random_uuid() so Postgres generates a random ID when you omit it on insert.
Not null — the 70% rule
By default every Postgres column allows null unless you say otherwise. More than 70% of your columns should be NOT NULL, because nulls quietly creep in — especially via automated scripts with bugs — and leave you with a messy, inconsistent table. Add NOT NULL unless a field is genuinely optional (like a user's bio or phone).
Unique
No two rows may share that value — e.g. email TEXT NOT NULL UNIQUE means inserting a duplicate email throws a database-level error.
Foreign key constraint
REFERENCES users(id) guarantees that whatever you put in assigned_to or owner_id must be a real, existing user ID. Pass a random (even valid-looking) UUID with no matching user and the insert fails. You can't reference something that doesn't exist.
Check constraint
A custom rule the row must satisfy — the transaction only goes through while the condition holds. CHECK (priority BETWEEN 1 AND 5) blocks anyone from inserting priority = 55.
Enums — integrity + documentation
An enum is a custom type limited to a fixed set of values (e.g. project_status ∈ {active, completed, archived}). Two wins:
- Data integrity — insert any other string into a
project_statuscolumn and you get a database-level error; you don't have to rely solely on app-code validation. - Self-documentation — arguably the bigger win. A future teammate reading the migration sees the allowed values at a glance. If the column were plain
text, they'd have to grep the entire application code to discover which strings are actually valid.
Referential integrity — ON DELETE behavior
When you delete a row other tables point at, what happens to those references? You declare it per foreign key:
| Clause | On deleting the referenced row… | Used here for |
|---|---|---|
ON DELETE RESTRICT | Block the delete if any child rows still reference it. | projects.owner_id — can't delete a user who still owns projects. |
ON DELETE CASCADE | Delete the children too. | tasks.project_id — deleting a project deletes its tasks. |
ON DELETE SET NULL | Set the child's reference to null (column must allow null). | tasks.assigned_to — deleting a user un-assigns their tasks. |
ON DELETE SET DEFAULT | Set the child's reference to its declared default. | — |
Every constraint — primary key, not null, unique, foreign key, check, enum, ON DELETE — exists so the database guarantees correctness, not just hopeful application code. The more you encode at this level, the harder it is for a future bug to corrupt your data.
Querying & Joins
Build queries from FROM, not SELECT
A habit worth adopting: when writing a SQL query, start with the FROM clause, not SELECT. Decide where your data comes from first — which tables, joined how — and only then decide what to pull out. It keeps you oriented: you always know the source before you reach for columns. We'll build the "list users with profile" query in exactly that order — source first, then shape, then ordering.
FROM users u — and give it a short alias (u) so you can refer to it concisely throughout the query.LEFT JOIN user_profiles up ON u.id = up.user_id.u.* plus the folded profile, to_jsonb(up.*) AS profile.ORDER BY u.created_at DESC so results are deterministic and newest-first.Joining tables — and why LEFT JOIN
To return each user with their profile in one API call, you join users to user_profiles on the foreign key. Use LEFT JOIN, not INNER JOIN: a user might never have created a profile row, and an INNER JOIN only returns rows where both sides have a match — so those profile-less users would silently vanish from the result, returning fewer users than exist. A LEFT JOIN keeps every row from the left (main) table regardless, filling the profile side with null when there's no match. Since we always want the user back, profile or not, LEFT JOIN is correct here. (INNER and LEFT joins cover the vast majority of real queries; the others are rare.)
Two more details in that query. The short aliases (u, up) just make a long query readable — you assign them once in FROM/JOIN and reuse them everywhere. And to_jsonb(up.*) is a built-in Postgres function that converts the matched profile row into a JSON object, embedded as a single profile field on each user — so the frontend gets user.profile nested, in one round trip, instead of making a second call. Always end a list query with an ORDER BY (here created_at DESC): query results come back in arbitrary order otherwise, and that order can change between calls, so you sort explicitly — usually newest-first by creation time.
SELECT u.*, to_jsonb(up.*) AS profile -- fold the profile row into one JSON field FROM users u LEFT JOIN user_profiles up -- LEFT so users without a profile still appear ON u.id = up.user_id -- the foreign-key join condition ORDER BY u.created_at DESC; -- newest first; never rely on default order
Parameterized queries — never concatenate
This is the single most important security habit with databases. A parameterized query leaves a placeholder slot and supplies the value separately; the database treats whatever you pass as pure data (a string), never executable SQL. If an attacker passes '; DELETE FROM users; -- into a parameter, it's stored/compared as a harmless literal string, not run as a command.
The vulnerability appears when you build queries by concatenating strings with user input. Then a crafted input can break out and run its own SQL — reading, altering, or wiping your data. Never string-concatenate user values into a query. Always use parameter placeholders; your driver/ORM handles the safe substitution. This isn't optional.
SELECT u.*, to_jsonb(up.*) AS profile FROM users u LEFT JOIN user_profiles up ON u.id = up.user_id WHERE u.id = $1; -- $1 is a parameter slot; the value is sent separately
Insert & update
-- POST /v1/users → returning the new row back to the client INSERT INTO users (email, full_name, password_hash) VALUES ($1, $2, $3) RETURNING *; -- PATCH /v1/users/:id → only the fields the client actually sent UPDATE user_profiles SET bio = $1, phone = $2 WHERE user_id = $3 RETURNING *;
Dynamic filter, sort & pagination
List endpoints support filtering, sorting, and pagination — built dynamically in app code from the query params, with sane defaults (no page → 1, no limit → 10, no sort → created_at DESC). At the SQL level:
- Filter with
ILIKE— case-insensitive pattern match.WHERE u.full_name ILIKE $1 || '%'matches names starting with a letter regardless of case. - Sort from an allow-list of columns the client may sort by (
full_name,email,created_at) plus a direction. - Paginate with
LIMIT+OFFSET. Note: the API's page 1 maps to databaseOFFSET 0— the offset is zero-based.
SELECT u.*, to_jsonb(up.*) AS profile FROM users u LEFT JOIN user_profiles up ON u.id = up.user_id WHERE u.full_name ILIKE $1 || '%' -- filter: names starting with $1 (case-insensitive) ORDER BY u.created_at DESC -- sort (column from an allow-list, direction chosen) LIMIT $2 OFFSET $3; -- page 1 → OFFSET 0, page 2 → OFFSET (limit), ...
Indexes
Indexes are the biggest performance lever you have, and they're best understood through the book analogy. A book's index lists each chapter and its page number, so to reach chapter 4 you don't flip through 50 pages one by one — you look it up and jump straight there.
The problem: sequential scan
Remember databases store rows scattered across the disk, not in neat order. Without an index, finding one task by ID means a sequential scan: the database checks row after row, location after location, until it finds a match. With six rows that's instant; with a million or a billion rows it's painfully slow.
The solution: a lookup table
An index is essentially a separate, ordered lookup table mapping each indexed value to the disk location of its row. Query by an indexed field and the database scans this compact, sorted structure (internally a B-tree), finds the entry fast, reads the location, and jumps straight to the row — no full scan. Two properties matter:
- Direct lookup — value → location, instead of checking every row.
- Order — an index can be ascending or descending. If you frequently run
ORDER BY created_at DESC, build the index descending so the rows come back already in order, with no extra sort step.
When to create an index
The rule of thumb — index a column when it's used in a WHERE clause, a JOIN condition, or an ORDER BY. One important detail: primary keys are auto-indexed, but foreign keys are not — so foreign keys you join on (like tasks.project_id, tasks.assigned_to) are prime index candidates.
Walking the actual reasoning behind each index in our schema makes the rule concrete:
users.email→ there's a "find user by email" lookup and joins on email, so it appears inWHERE/JOIN→ index it.users.created_at DESC→ the list-users API sorts newest-first by default and is called constantly, so index it descending to match theORDER BYexactly.tasks.project_id→ "fetch all tasks of a project" joinsprojectstotaskson this foreign key. Theprojects.idside is the primary key (already indexed), butproject_idhere is a foreign key and isn't — so index it.tasks.assigned_to→ "fetch all tasks for a user" joins on this foreign key. Same logic — foreign key, not auto-indexed, used in a join → index it.tasks.status→ "fetch all pending tasks" filters on status in aWHEREclause → index it.tasks.created_at DESC→ the list-tasks API sorts newest-first → index descending.
Internally a B-tree keeps these entries sorted, which is what makes both the direct lookup and the ordered scan fast — but you don't need the internals to apply the rule; the three triggers (WHERE / JOIN / ORDER BY) are enough to decide.
CREATE INDEX idx_users_email ON users (email); -- WHERE / JOIN on email CREATE INDEX idx_users_created_at ON users (created_at DESC); -- ORDER BY newest first CREATE INDEX idx_tasks_project_id ON tasks (project_id); -- FK join (not auto-indexed) CREATE INDEX idx_tasks_assigned_to ON tasks (assigned_to); -- FK join (not auto-indexed) CREATE INDEX idx_tasks_status ON tasks (status); -- WHERE status = ... CREATE INDEX idx_tasks_created_at ON tasks (created_at DESC); -- ORDER BY newest first
Indexes speed up reads but slightly slow down writes: every INSERT/UPDATE must also keep the index up to date. So don't blindly index everything. Evaluate how frequently the query runs and whether the read win is worth the write cost. Practical move: create indexes for hot WHERE/JOIN/ORDER BY columns, then monitor — drop ones whose queries turn out to be rare.
Triggers
Every table here has an updated_at column that should be refreshed to "now" each time the row changes. You could set it manually in every UPDATE — or you can automate it with a trigger so the application never has to think about it.
A trigger runs a function automatically when a condition is met. Here: define a function that sets NEW.updated_at = now() and returns the row, then attach a BEFORE UPDATE trigger on each table that calls it.
-- the function: stamp updated_at, return the modified row CREATE FUNCTION set_updated_at() RETURNS TRIGGER AS $$ BEGIN NEW.updated_at = now(); RETURN NEW; END; $$ LANGUAGE plpgsql; -- attach it before every UPDATE on each table CREATE TRIGGER trg_users_updated BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION set_updated_at(); CREATE TRIGGER trg_projects_updated BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at(); -- ...and likewise for user_profiles, tasks, project_members
Now any UPDATE — from anywhere, app code or a manual query — automatically refreshes updated_at. No column gets forgotten, and the logic lives in one place.
Databases persist data on cheap, roomy disk; a DBMS gives you efficient, safe CRUD that text files can't. Default to Postgres for its integrity plus JSONB. Model with the right relationships (one-to-one, one-to-many, many-to-many via a linking table), enforce correctness with constraints & enums, evolve safely through migrations, query with joins and always-parameterized SQL, and reach for the two performance levers — indexes (on WHERE/JOIN/ORDER BY columns) and triggers (to automate the tedious). That's ~80% of database work as a backend engineer.
Transactions & ACID
Back in section 3 we saw why a text file can't handle two people editing the same value at once — the lost-update race. A real DBMS solves that with transactions: a group of operations that either all succeed together or none do. This is the machinery that makes concurrency safe, and it's worth understanding because money, inventory, and any "read-then-write" logic depend on it.
The classic example — a money transfer
Move ₹500 from account A to account B. That's two writes: subtract from A, add to B. If the server crashes between them, A has lost ₹500 that never arrived at B — money vanished. A transaction makes the pair atomic: wrap them in BEGIN … COMMIT, and either both happen or, on any failure, ROLLBACK undoes everything as if nothing occurred.
BEGIN; -- start the transaction UPDATE accounts SET balance = balance - 500 WHERE id = $1; -- debit A UPDATE accounts SET balance = balance + 500 WHERE id = $2; -- credit B COMMIT; -- both are now permanent -- if anything fails between BEGIN and COMMIT → ROLLBACK undoes it all
ACID — the four guarantees
Transactions give you four properties, abbreviated ACID:
| Property | Guarantee | In the transfer example |
|---|---|---|
| Atomicity | All operations succeed, or none do. | You never debit A without crediting B. |
| Consistency | The DB moves from one valid state to another; constraints always hold. | Total money across accounts stays correct. |
| Isolation | Concurrent transactions don't step on each other; each runs as if alone. | Two transfers at once can't corrupt a balance. |
| Durability | Once committed, it survives crashes/power loss. | After COMMIT, the transfer is permanent. |
This is a major reason to reach for a relational database like Postgres when data is critical — ACID transactions are a first-class guarantee, exactly the thing the section-3 text file could never provide.
Isolation & the lost-update race, revisited
Isolation is the property that actually fixes our two-writers-on-40 problem. Under the hood the database uses locking and MVCC (multi-version concurrency control) so that concurrent transactions see a consistent snapshot and serialize their conflicting writes, rather than blindly overwriting each other.
For read-then-write logic specifically, there are two standard tools:
- Atomic write — push the computation into the SQL itself:
UPDATE accounts SET balance = balance - 500 …reads and writes in one locked step, so there's no gap for another writer to slip into. (Far safer than reading the balance into app code, computing, and writing it back.) - Row locking —
SELECT … FOR UPDATElocks the rows you read until your transaction ends, so a concurrent transaction must wait its turn instead of racing you.
BEGIN; -- lock this account's row so no other transaction can change it until we're done SELECT balance FROM accounts WHERE id = $1 FOR UPDATE; -- ... compute safely in app code ... UPDATE accounts SET balance = $2 WHERE id = $1; COMMIT;
The naive pattern — SELECT balance into your program, add in code, UPDATE it back — is precisely the section-3 race condition, just moved into your backend. Two requests both read the old balance and one overwrites the other. Use an atomic SQL update (balance = balance - 500) or FOR UPDATE row locking inside a transaction. This is one of the most common real-world concurrency bugs.
Any time a single logical operation spans multiple writes that must all hold together — transfers, "create order + decrement stock," "archive org + cascade cleanup" — wrap it in a transaction. ACID is what lets you treat that group as one indivisible, crash-safe, concurrency-safe unit. It's the database keeping the promise a text file never could.
Production Essentials
The transcript stops at the fundamentals. These four topics aren't in it, but you'll hit every one of them the moment your database serves real traffic. They're the difference between "it works on my machine with six rows" and "it works at scale."
The N+1 query problem — the most common DB performance bug
You fetch a list of N projects, then loop over them and fire one more query per project to load its tasks. That's 1 + N queries — for 100 projects, 101 round trips to the database. Each round trip has network + planning overhead, so this is death by a thousand cuts. It's especially insidious with ORMs, where lazy-loading a relationship inside a loop hides the extra queries behind innocent-looking code.
The fix is to fetch related data in one query — a JOIN, or a batched WHERE project_id = ANY($1), or an aggregate that builds nested JSON server-side. With an ORM, the equivalent is "eager loading" (e.g. JOIN/preload/include) instead of lazy loading inside a loop.
-- instead of: SELECT * FROM projects; then per row: SELECT * FROM tasks WHERE project_id = ? SELECT p.*, COALESCE(jsonb_agg(t.*) FILTER (WHERE t.id IS NOT NULL), '[]') AS tasks FROM projects p LEFT JOIN tasks t ON t.project_id = p.id GROUP BY p.id; -- one round trip, every project with its tasks nested
Connection pooling
Opening a Postgres connection is expensive — Postgres spawns a process per connection, and there's a hard ceiling (default ~100 max_connections). If every web request opens its own connection, you exhaust the server and pay the setup cost constantly. The fix is a connection pool: a small set of long-lived connections that requests borrow and return.
Two layers, often combined: the driver's built-in pool (Go's database/sql pools automatically; Python uses psycopg_pool/SQLAlchemy's pool) for a single app instance, and an external pooler like PgBouncer when many app instances would otherwise each hold their own pool and overwhelm Postgres. Tune pool size to your workload — bigger isn't better; too many connections degrade Postgres.
EXPLAIN ANALYZE — verify the index is actually used
Creating an index doesn't guarantee the planner uses it. EXPLAIN shows the query plan; EXPLAIN ANALYZE actually runs the query and reports real timings. You read it looking for "Index Scan" (good — your index is working) versus "Seq Scan" (the planner read every row).
Running EXPLAIN SELECT * FROM tasks WHERE status = 'pending' against the chapter's actual schema on Postgres 16 produced: Bitmap Index Scan on idx_tasks_status — confirming the status index from section 11 is genuinely used for that filter, not just created and ignored.
One caveat that confuses people: on a tiny table the planner may correctly choose a Seq Scan anyway (reading 6 rows directly is faster than the index round-trip). That's not a bug — index benefits show up at scale. EXPLAIN ANALYZE on production-sized data is how you confirm an index earns its keep, and how you debug a slow query that should be using one but isn't.
Soft deletes
Often you don't want a row truly gone — you want it kept for audit, recovery, or historical integrity, but hidden from normal queries. That's a soft delete: add a nullable deleted_at TIMESTAMPTZ column (null = live, a timestamp = deleted). "Deleting" becomes an UPDATE; every read filters out the deleted rows.
ALTER TABLE users ADD COLUMN deleted_at TIMESTAMPTZ; -- null = alive -- "delete" = mark, don't remove UPDATE users SET deleted_at = now() WHERE id = $1; -- every read must exclude the dead rows SELECT * FROM users WHERE deleted_at IS NULL ORDER BY created_at DESC; -- gotcha: a plain UNIQUE now blocks reusing a soft-deleted email. -- a PARTIAL unique index fixes it — uniqueness only among LIVE rows: CREATE UNIQUE INDEX uniq_active_email ON users (email) WHERE deleted_at IS NULL;
The trade-offs are real: every query must remember the WHERE deleted_at IS NULL filter (easy to forget — a leak), unique constraints need the partial-index treatment shown above, and the table only ever grows. Use soft deletes where history matters (orders, user accounts, anything auditable); use hard DELETE for genuinely disposable data.
The fundamentals get you a correct database; these get you a fast, safe, operable one. Collapse N+1 into joins, pool your connections, prove your indexes with EXPLAIN ANALYZE, and reach for soft deletes when data must survive its own deletion. None of it is exotic — it's the day-two reality of running a database in production.