Normalisation — Designing Tables That Scale

Database normalisation is the process of organising tables to minimise data redundancy and prevent anomalies — situations where inserting, updating, or deleting data produces incorrect results. The Normal Forms are a set of rules that guide this organisation. In practice, following normalisation principles produces schemas that are easier to maintain, less prone to bugs, and easier to query correctly. Understanding normalisation helps you design the blog application schema used throughout this series and explains why certain column arrangements are correct while others lead to subtle data consistency problems.

The Problems Normalisation Solves

-- ── UNNORMALISED: denormalised posts table ────────────────────────────────────
-- Problem: author information duplicated in every post row
CREATE TABLE posts_bad (
    id           BIGSERIAL PRIMARY KEY,
    title        TEXT,
    body         TEXT,
    author_name  TEXT,       -- ← duplicated in every post by this author
    author_email TEXT,       -- ← if email changes, must update every post row!
    author_role  TEXT,
    tag1         TEXT,       -- ← fixed number of tags — breaks with 4 tags
    tag2         TEXT,
    tag3         TEXT
);
-- Insert anomaly: cannot add an author without a post
-- Update anomaly: changing author email requires updating N rows
-- Delete anomaly: deleting last post by an author loses their info
-- Tag limitation: arbitrary column count, hard to query, max 3 tags

-- ── NORMALISED: separate tables ───────────────────────────────────────────────
CREATE TABLE users  (id BIGSERIAL PRIMARY KEY, name TEXT, email TEXT UNIQUE, role TEXT);
CREATE TABLE posts  (id BIGSERIAL PRIMARY KEY, author_id BIGINT REFERENCES users(id), title TEXT, body TEXT);
CREATE TABLE tags   (id BIGSERIAL PRIMARY KEY, name TEXT UNIQUE);
CREATE TABLE post_tags (post_id BIGINT REFERENCES posts(id), tag_id BIGINT REFERENCES tags(id), PRIMARY KEY(post_id, tag_id));
-- Each fact stored once — no duplication, no anomalies, unlimited tags
Note: The three most common Normal Forms are: 1NF — no repeating groups, each column holds atomic (indivisible) values (no comma-separated lists, no tag1/tag2/tag3 columns); 2NF — 1NF plus no partial dependencies (every non-key column depends on the whole primary key, not just part of it); 3NF — 2NF plus no transitive dependencies (non-key columns depend only on the primary key, not on other non-key columns). For most web applications, designing to 3NF produces good schemas without needing the higher Normal Forms.
Tip: Think of normalisation as asking “where does this fact live?” for every piece of data. The author’s email is a fact about the user, not about the post — so it belongs in the users table. If a fact appears in multiple rows, that is a signal it belongs in its own table with a foreign key relationship. The rule of thumb: if updating one real-world thing requires updating multiple rows in the database, the schema is not normalised enough.
Warning: Normalisation is a design principle, not an absolute rule. Denormalisation — intentionally storing redundant data for performance — is sometimes the right choice. Read-heavy tables with complex JOINs may benefit from storing computed or duplicated values to avoid expensive joins on every request. The key is to denormalise consciously and deliberately, with a clear understanding of what consistency guarantees you are trading away and how you will maintain consistency (triggers, application logic, periodic sync jobs).

Normal Forms in Practice

-- ── First Normal Form (1NF) ───────────────────────────────────────────────────
-- Rule: no repeating groups, atomic values only

-- Violates 1NF: tags stored as comma-separated string (not atomic)
CREATE TABLE posts_1nf_bad (
    id    BIGSERIAL PRIMARY KEY,
    title TEXT,
    tags  TEXT   -- "python,fastapi,postgresql" — not atomic!
);

-- Satisfies 1NF: separate junction table
-- See post_tags table above ✓

-- ── Second Normal Form (2NF) ─────────────────────────────────────────────────
-- Rule: 1NF + no partial dependencies (applies when PK is composite)

-- Violates 2NF: order_items with partial dependency
CREATE TABLE order_items_bad (
    order_id    BIGINT,
    product_id  BIGINT,
    quantity    INTEGER,
    product_name TEXT,   -- depends on product_id alone, not (order_id, product_id)!
    unit_price  NUMERIC, -- same issue
    PRIMARY KEY (order_id, product_id)
);
-- Fix: move product_name and unit_price to products table; store price_at_purchase
-- in order_items (that IS specific to the order)

-- ── Third Normal Form (3NF) ──────────────────────────────────────────────────
-- Rule: 2NF + no transitive dependencies

-- Violates 3NF: zip_code → city → state (transitive dependency)
CREATE TABLE users_3nf_bad (
    id       BIGSERIAL PRIMARY KEY,
    name     TEXT,
    zip_code TEXT,
    city     TEXT,   -- depends on zip_code, not directly on id
    state    TEXT    -- depends on zip_code, not directly on id
);
-- Fix: separate addresses table with zip_code as PK,
-- users references addresses by zip_code

Practical Denormalisation Patterns

-- ── Counter caches — store aggregate for fast reads ──────────────────────────
ALTER TABLE posts ADD COLUMN comment_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE posts ADD COLUMN like_count    INTEGER NOT NULL DEFAULT 0;

-- Update the cache when a comment is added (trigger or application logic)
UPDATE posts SET comment_count = comment_count + 1 WHERE id = :post_id;
-- Reading: SELECT comment_count FROM posts WHERE id = ? (no JOIN needed)
-- Writing: must keep in sync (risk of inconsistency if not managed carefully)

-- ── Materialised author name — avoid join for display ────────────────────────
ALTER TABLE posts ADD COLUMN author_name_cache TEXT;
-- Set on insert, update when author name changes (via trigger)
-- Trade-off: faster reads, but author name changes require cache update

-- ── When denormalisation is appropriate ──────────────────────────────────────
-- ✓ Counts that are read thousands of times per second (like count, view count)
-- ✓ Data that rarely changes (country name, static configuration)
-- ✓ Reporting tables that aggregate from transactional tables
-- ✗ Data that changes frequently (user email, post body)
-- ✗ Data where consistency is critical (financial balances)

Common Mistakes

Mistake 1 — Storing multiple values in a single column

❌ Wrong — comma-separated tags in one column:

tags TEXT DEFAULT ''   -- "python,fastapi" — cannot query individual tags efficiently

✅ Correct — separate junction table or PostgreSQL array:

-- Option A: junction table (recommended for relational schemas)
-- Option B: ARRAY type (covered in Lesson 4)
tags TEXT[] DEFAULT '{}'   -- PostgreSQL native array ✓

Mistake 2 — Duplicating data that changes

❌ Wrong — author email stored in each post:

author_email TEXT   -- when email changes: UPDATE every post by that author!

✅ Correct — store foreign key, join when needed:

author_id BIGINT REFERENCES users(id)   -- ✓ one update when email changes

Mistake 3 — Over-normalising lookup tables

❌ Wrong — separate table for a static, bounded list of statuses:

CREATE TABLE post_statuses (id SERIAL, name TEXT);   -- 3 rows, never changes
post_status_id INTEGER REFERENCES post_statuses(id)  -- unnecessary JOIN for a static list

✅ Correct — use a CHECK constraint or ENUM type:

status TEXT CHECK (status IN ('draft', 'published', 'archived'))   -- ✓

Quick Reference — Normal Forms

Form Rule Common Violation
1NF Atomic values, no repeating groups Comma-separated lists, tag1/tag2/tag3
2NF 1NF + no partial dependencies Non-key column depends on part of composite PK
3NF 2NF + no transitive dependencies A→B→C: city depends on zip, not on user id
Denormalise Intentional redundancy for performance Counter caches, materialised aggregates

🧠 Test Yourself

A posts table stores author_name and author_email directly. An author changes their email. What anomaly does this cause, and what is the fix?