Data Modelling — Embedding vs Referencing and Schema Design Patterns

Schema design is the most consequential architectural decision in a MongoDB application. The right data model makes every query fast and every feature easy to implement. The wrong data model leads to expensive aggregations, frequent joins, and data consistency nightmares. Unlike SQL where normalisation rules are well-established, MongoDB schema design depends on your application’s specific access patterns. The fundamental question is always: should related data be embedded in the same document or stored in separate collections with references? This lesson gives you a systematic framework for making this decision correctly, with real-world patterns from the task manager application and beyond.

Embedding vs Referencing Decision Framework

Factor Favour Embedding Favour Referencing
Read pattern Data always accessed together Data often accessed independently
Write pattern Data always updated together Data updated independently
Relationship cardinality One-to-few (user → 3 addresses) One-to-many, one-to-squillions
Data duplication Tolerable — read performance matters Intolerable — data must stay in sync
Document size Sub-document is small and bounded Sub-document could grow unboundedly
Atomicity needed Parent + child must update atomically Updates are independent operations

Cardinality Relationships

Relationship Example Best Approach
One-to-One User → Profile Embed profile in user document
One-to-Few (2-10) Post → Comments (small posts) Embed array in parent document
One-to-Many (10-1000) User → Tasks Reference — store userId in task
One-to-Squillions (1000+) Sensor → Log entries Reference — never embed (document size limit)
Many-to-Many Task → Tags, User → Projects Array of IDs on the “many” side or junction collection

MongoDB Document Size Limits

Limit Value Implication
Maximum document size 16 MB Embedding unbounded arrays can hit this limit
Recommended document size < 1 MB Large documents are slower to read even when projecting
Maximum array elements No hard limit (but < 16MB total) Arrays that grow indefinitely must be in separate collections
Note: The most common mistake is applying SQL normalisation rules to MongoDB. In SQL, normalisation avoids duplication and prevents update anomalies. In MongoDB, some strategic duplication is acceptable and desirable. If your task document embeds { assignee: { name: 'Alice', email: 'alice@example.com' } }, updating Alice’s email requires updating every task she’s assigned to — but if you access that data together constantly and Alice rarely changes her email, this trade-off is worth it. Always model for your application’s access patterns, not for theoretical purity.
Tip: The subset pattern is a powerful middle ground. Instead of embedding all comments in a blog post (which could be thousands), embed only the 5 most recent comments in the post document and store the rest in a separate comments collection. The common case (showing the top comments on a post) is a single document read. Viewing all comments requires a second query — but this is an infrequent operation. This pattern gives you embedding’s read performance for the common case without the unbounded growth problem.
Warning: Never embed data that will grow without bound. A task document that embeds all activity log entries will eventually hit MongoDB’s 16MB document limit. Activity logs, comments on popular posts, sensor readings, and chat messages all belong in separate collections. Ask yourself: “Could this array have 10,000 entries in a year?” If yes, use a separate collection with a reference.

Schema Design Patterns

// ── Pattern 1: Embed for one-to-one (User → Preferences) ─────────────────
const userSchema = new mongoose.Schema({
    name:     { type: String, required: true },
    email:    { type: String, required: true, unique: true },
    password: { type: String, required: true, select: false },

    // Embed: preferences are small, always read with user, user-specific
    preferences: {
        theme:             { type: String, enum: ['light', 'dark'], default: 'light' },
        emailNotifications:{ type: Boolean, default: true },
        defaultView:       { type: String, enum: ['list', 'board', 'calendar'], default: 'list' },
        timezone:          { type: String, default: 'UTC' },
    },

    // Embed: social links are few and always shown on profile
    socialLinks: {
        github:   String,
        linkedin: String,
        website:  String,
    },
}, { timestamps: true });

// ── Pattern 2: Reference for one-to-many (User → Tasks) ───────────────────
// Store userId in the task (NOT array of taskIds in the user)
const taskSchema = new mongoose.Schema({
    title:       { type: String, required: true },
    status:      { type: String, enum: ['pending', 'in-progress', 'completed'] },
    priority:    { type: String, enum: ['low', 'medium', 'high'] },
    user:        { type: mongoose.Types.ObjectId, ref: 'User', required: true },
    // ↑ Reference: tasks can number in thousands, queried independently
}, { timestamps: true });

// Query tasks for a user (uses index on user field)
const tasks = await Task.find({ user: userId });

// ── Pattern 3: Embed for one-to-few (Task → Attachments) ──────────────────
// Attachments are few (max 10), always read with task, task-specific
const taskSchemaWithAttachments = new mongoose.Schema({
    title: String,
    // Embed: small, bounded, always accessed with task
    attachments: [{
        filename:   { type: String, required: true },
        url:        { type: String, required: true },
        size:       Number,
        mimeType:   String,
        uploadedAt: { type: Date, default: Date.now },
    }],
});

// ── Pattern 4: Subset Pattern (Post → Comments — show top 5 only) ─────────
const postSchema = new mongoose.Schema({
    title:   String,
    content: String,
    author:  { type: mongoose.Types.ObjectId, ref: 'User' },
    // Subset: embed only the 5 latest comments for fast display
    // Full comments list is in the comments collection
    recentComments: [{
        _id:       mongoose.Types.ObjectId,
        text:      String,
        authorName:String,
        createdAt: Date,
    }],
    commentCount: { type: Number, default: 0 },
});

const commentSchema = new mongoose.Schema({
    postId:    { type: mongoose.Types.ObjectId, ref: 'Post', required: true },
    author:    { type: mongoose.Types.ObjectId, ref: 'User', required: true },
    text:      String,
}, { timestamps: true });

// On new comment: update comments collection AND update postSchema.recentComments
async function addComment(postId, userId, text) {
    const comment = await Comment.create({ postId, author: userId, text });

    // Update the post's recentComments array (subset pattern)
    await Post.findByIdAndUpdate(postId, {
        $push:  { recentComments: { $each: [{ text, authorName: 'Alice', createdAt: new Date() }], $slice: -5 } },
        $inc:   { commentCount: 1 },
    });
    return comment;
}

// ── Pattern 5: Extended Reference (denormalise for read performance) ───────
// Task embeds key user fields to avoid a join on every task read
const taskSchemaWithRef = new mongoose.Schema({
    title:  String,
    user:   { type: mongoose.Types.ObjectId, ref: 'User' },  // reference for updates

    // Extended reference: denormalise critical display fields
    // Avoids populate() on every task list request
    assigneeName:  String,
    assigneeEmail: String,
    assigneeAvatar:String,
});

// When user updates their name, update all their tasks too
userSchema.post('findOneAndUpdate', async function(doc) {
    if (doc) {
        await Task.updateMany({ user: doc._id }, {
            $set: { assigneeName: doc.name, assigneeAvatar: doc.avatar }
        });
    }
});

// ── Pattern 6: Many-to-Many (Task → Tags) ────────────────────────────────
// Option A: Array of strings in task (for user-specific, non-shared tags)
const taskTagSchema = new mongoose.Schema({
    title: String,
    tags:  [String],  // ['urgent', 'Q4', 'client']
});

// Option B: Array of ObjectIds referencing a tags collection (for shared/global tags)
const taskTagRefSchema = new mongoose.Schema({
    title: String,
    tags:  [{ type: mongoose.Types.ObjectId, ref: 'Tag' }],
});

const tagSchema = new mongoose.Schema({
    name:  { type: String, unique: true },
    color: String,
    usageCount: { type: Number, default: 0 },
});

How It Works

Step 1 — Access Pattern Drives Schema Design

The first question for any schema decision is: how will this data be accessed? Write down the five most frequent read queries and the five most frequent write operations. The schema should make the most frequent reads as fast as possible. If you always read a user’s preferences with their profile, embed preferences. If you rarely read comments when viewing a post list, put comments in a separate collection.

Step 2 — Embedding Eliminates Joins at the Cost of Duplication

When you embed a user’s name in every task document, you pay disk space and update complexity for the benefit of being able to display task lists without joining the users collection. This is MongoDB’s fundamental trade-off — data is pre-joined at write time so reads are fast. The acceptable trade-off depends on how frequently the embedded data changes (rarely) vs how frequently the document is read (constantly).

Step 3 — Referencing Keeps Data Consistent at the Cost of Extra Queries

Storing user: ObjectId in a task means there is one authoritative place for the user’s name and email. When the user changes their name, only the user document needs updating. The task always shows the current name via Mongoose’s populate(). The downside: displaying a list of 100 tasks with user names requires either 101 queries (N+1 problem) or a $lookup aggregation.

Step 4 — Mongoose populate() Resolves References Automatically

Mongoose’s populate() method issues a second query to resolve ObjectId references: Task.find().populate('user', 'name email avatar') fetches all matching tasks, collects the unique user IDs, issues a single User.find({ _id: { $in: [ids] } }), and merges the results. It is two queries, not N+1. The result appears seamlessly in the returned document as if the user data were embedded.

Step 5 — Hybrid Patterns Combine Both Approaches

The most pragmatic approach is hybrid: reference the complete entity (for updates and independent access) while also embedding a subset of its most-needed fields for fast reads. Store user: ObjectId for relational integrity AND assigneeName: 'Alice' for fast display. Accept that if Alice changes her name, you need a migration to update all her tasks. This is the Extended Reference pattern — it is widely used in production MongoDB applications.

Real-World Example: Task Manager Complete Schema

// Complete schema decisions for the Task Manager application

// USERS — parent entity
const userSchema = new mongoose.Schema({
    name:     { type: String, required: true, trim: true },
    email:    { type: String, required: true, unique: true, lowercase: true },
    password: { type: String, select: false },     // always excluded from queries
    role:     { type: String, enum: ['user', 'admin'], default: 'user' },
    avatar:   String,

    // EMBED: small, always read with user, user-owned data
    preferences: { theme: String, notifications: Boolean, timezone: String },

    // EMBED: refresh tokens array (bounded — max 5 per user)
    refreshTokens: [{
        token:     String,
        expiresAt: Date,
        device:    String,
    }],
}, { timestamps: true });

// TASKS — child entity, references parent
const taskSchema = new mongoose.Schema({
    title:       { type: String, required: true, maxlength: 200 },
    description: { type: String, maxlength: 2000 },
    status:      { type: String, enum: ['pending', 'in-progress', 'completed'], default: 'pending' },
    priority:    { type: String, enum: ['low', 'medium', 'high'], default: 'medium' },
    dueDate:     Date,
    completedAt: Date,
    tags:        [String],  // EMBED: user-specific string tags, bounded, always with task

    // REFERENCE: user can have many tasks; user updated independently
    user: { type: mongoose.Types.ObjectId, ref: 'User', required: true },

    // EMBED (one-to-few): attachments bounded to ~10 per task
    attachments: [{
        filename: String, url: String, size: Number, mimeType: String,
    }],

    // SOFT DELETE support
    deletedAt: Date,
}, { timestamps: true });

// Indexes on tasks
taskSchema.index({ user: 1, status: 1, createdAt: -1 });
taskSchema.index({ user: 1, priority: 1, createdAt: -1 });
taskSchema.index({ user: 1, dueDate: 1 });
taskSchema.index({ tags: 1 });
taskSchema.index({ title: 'text', description: 'text' });

Common Mistakes

Mistake 1 — Storing array of child IDs in the parent for one-to-many

❌ Wrong — array grows without bound; hard to query:

const userSchema = new mongoose.Schema({
    taskIds: [{ type: ObjectId, ref: 'Task' }],  // grows to thousands!
    // User document bloat; complex to paginate; difficult to filter tasks
});

✅ Correct — store the parent reference in the child document:

const taskSchema = new mongoose.Schema({
    user: { type: ObjectId, ref: 'User', required: true },  // reference in child
    // Query: Task.find({ user: userId }) — uses index, paginatable, filterable
});

Mistake 2 — Embedding unbounded arrays that grow indefinitely

❌ Wrong — activity log embedded in task grows without limit:

taskSchema = new mongoose.Schema({
    activityLog: [{  // every status change, comment, edit logged here
        action: String, timestamp: Date, userId: ObjectId,
    }],
    // After 100 changes: document is large. After 1000: near the 16MB limit!
});

✅ Correct — separate collection for unbounded data:

const activitySchema = new mongoose.Schema({
    task:      { type: ObjectId, ref: 'Task', required: true },
    action:    String,
    userId:    ObjectId,
    timestamp: { type: Date, default: Date.now },
});
activitySchema.index({ task: 1, timestamp: -1 });  // fast timeline queries

Mistake 3 — Using populate() for large list queries (N+1 performance)

❌ Wrong — populate on a list of 1000 tasks creates extra round-trips:

const tasks = await Task.find({ userId }).populate('user');
// Mongoose: find 1000 tasks + find users for all unique userIds
// Better than true N+1, but still 2 queries for data that could be embedded

✅ Consider: embed display fields for frequently accessed list data:

// On task creation: embed the creator's display name
const task = await Task.create({ ...data, creatorName: req.user.name });
// List query: no populate needed — name is already in the document

Quick Reference — Decision Guide

Data Relationship Pattern
User preferences One-to-one Embed in user document
Task attachments One-to-few (bounded) Embed array in task
User tasks One-to-many Reference (userId in task)
Task activity log One-to-squillions Separate collection
Post recent comments Subset pattern Embed top 5, reference rest
Creator name in task Extended reference Embed name + store userId
Task tags Many-to-many (strings) Embed string array in task
Task-to-user tags Many-to-many (entities) ObjectId array + Tag collection

🧠 Test Yourself

A user can have thousands of tasks. The task list API queries tasks by userId with pagination. Where should the userId be stored?