Schema design is the most consequential architectural decision in a MongoDB application. The right data model makes every query fast and every feature easy to implement. The wrong data model leads to expensive aggregations, frequent joins, and data consistency nightmares. Unlike SQL where normalisation rules are well-established, MongoDB schema design depends on your application’s specific access patterns. The fundamental question is always: should related data be embedded in the same document or stored in separate collections with references? This lesson gives you a systematic framework for making this decision correctly, with real-world patterns from the task manager application and beyond.
Embedding vs Referencing Decision Framework
| Factor | Favour Embedding | Favour Referencing |
|---|---|---|
| Read pattern | Data always accessed together | Data often accessed independently |
| Write pattern | Data always updated together | Data updated independently |
| Relationship cardinality | One-to-few (user → 3 addresses) | One-to-many, one-to-squillions |
| Data duplication | Tolerable — read performance matters | Intolerable — data must stay in sync |
| Document size | Sub-document is small and bounded | Sub-document could grow unboundedly |
| Atomicity needed | Parent + child must update atomically | Updates are independent operations |
Cardinality Relationships
| Relationship | Example | Best Approach |
|---|---|---|
| One-to-One | User → Profile | Embed profile in user document |
| One-to-Few (2-10) | Post → Comments (small posts) | Embed array in parent document |
| One-to-Many (10-1000) | User → Tasks | Reference — store userId in task |
| One-to-Squillions (1000+) | Sensor → Log entries | Reference — never embed (document size limit) |
| Many-to-Many | Task → Tags, User → Projects | Array of IDs on the “many” side or junction collection |
MongoDB Document Size Limits
| Limit | Value | Implication |
|---|---|---|
| Maximum document size | 16 MB | Embedding unbounded arrays can hit this limit |
| Recommended document size | < 1 MB | Large documents are slower to read even when projecting |
| Maximum array elements | No hard limit (but < 16MB total) | Arrays that grow indefinitely must be in separate collections |
{ assignee: { name: 'Alice', email: 'alice@example.com' } }, updating Alice’s email requires updating every task she’s assigned to — but if you access that data together constantly and Alice rarely changes her email, this trade-off is worth it. Always model for your application’s access patterns, not for theoretical purity.Schema Design Patterns
// ── Pattern 1: Embed for one-to-one (User → Preferences) ─────────────────
const userSchema = new mongoose.Schema({
name: { type: String, required: true },
email: { type: String, required: true, unique: true },
password: { type: String, required: true, select: false },
// Embed: preferences are small, always read with user, user-specific
preferences: {
theme: { type: String, enum: ['light', 'dark'], default: 'light' },
emailNotifications:{ type: Boolean, default: true },
defaultView: { type: String, enum: ['list', 'board', 'calendar'], default: 'list' },
timezone: { type: String, default: 'UTC' },
},
// Embed: social links are few and always shown on profile
socialLinks: {
github: String,
linkedin: String,
website: String,
},
}, { timestamps: true });
// ── Pattern 2: Reference for one-to-many (User → Tasks) ───────────────────
// Store userId in the task (NOT array of taskIds in the user)
const taskSchema = new mongoose.Schema({
title: { type: String, required: true },
status: { type: String, enum: ['pending', 'in-progress', 'completed'] },
priority: { type: String, enum: ['low', 'medium', 'high'] },
user: { type: mongoose.Types.ObjectId, ref: 'User', required: true },
// ↑ Reference: tasks can number in thousands, queried independently
}, { timestamps: true });
// Query tasks for a user (uses index on user field)
const tasks = await Task.find({ user: userId });
// ── Pattern 3: Embed for one-to-few (Task → Attachments) ──────────────────
// Attachments are few (max 10), always read with task, task-specific
const taskSchemaWithAttachments = new mongoose.Schema({
title: String,
// Embed: small, bounded, always accessed with task
attachments: [{
filename: { type: String, required: true },
url: { type: String, required: true },
size: Number,
mimeType: String,
uploadedAt: { type: Date, default: Date.now },
}],
});
// ── Pattern 4: Subset Pattern (Post → Comments — show top 5 only) ─────────
const postSchema = new mongoose.Schema({
title: String,
content: String,
author: { type: mongoose.Types.ObjectId, ref: 'User' },
// Subset: embed only the 5 latest comments for fast display
// Full comments list is in the comments collection
recentComments: [{
_id: mongoose.Types.ObjectId,
text: String,
authorName:String,
createdAt: Date,
}],
commentCount: { type: Number, default: 0 },
});
const commentSchema = new mongoose.Schema({
postId: { type: mongoose.Types.ObjectId, ref: 'Post', required: true },
author: { type: mongoose.Types.ObjectId, ref: 'User', required: true },
text: String,
}, { timestamps: true });
// On new comment: update comments collection AND update postSchema.recentComments
async function addComment(postId, userId, text) {
const comment = await Comment.create({ postId, author: userId, text });
// Update the post's recentComments array (subset pattern)
await Post.findByIdAndUpdate(postId, {
$push: { recentComments: { $each: [{ text, authorName: 'Alice', createdAt: new Date() }], $slice: -5 } },
$inc: { commentCount: 1 },
});
return comment;
}
// ── Pattern 5: Extended Reference (denormalise for read performance) ───────
// Task embeds key user fields to avoid a join on every task read
const taskSchemaWithRef = new mongoose.Schema({
title: String,
user: { type: mongoose.Types.ObjectId, ref: 'User' }, // reference for updates
// Extended reference: denormalise critical display fields
// Avoids populate() on every task list request
assigneeName: String,
assigneeEmail: String,
assigneeAvatar:String,
});
// When user updates their name, update all their tasks too
userSchema.post('findOneAndUpdate', async function(doc) {
if (doc) {
await Task.updateMany({ user: doc._id }, {
$set: { assigneeName: doc.name, assigneeAvatar: doc.avatar }
});
}
});
// ── Pattern 6: Many-to-Many (Task → Tags) ────────────────────────────────
// Option A: Array of strings in task (for user-specific, non-shared tags)
const taskTagSchema = new mongoose.Schema({
title: String,
tags: [String], // ['urgent', 'Q4', 'client']
});
// Option B: Array of ObjectIds referencing a tags collection (for shared/global tags)
const taskTagRefSchema = new mongoose.Schema({
title: String,
tags: [{ type: mongoose.Types.ObjectId, ref: 'Tag' }],
});
const tagSchema = new mongoose.Schema({
name: { type: String, unique: true },
color: String,
usageCount: { type: Number, default: 0 },
});
How It Works
Step 1 — Access Pattern Drives Schema Design
The first question for any schema decision is: how will this data be accessed? Write down the five most frequent read queries and the five most frequent write operations. The schema should make the most frequent reads as fast as possible. If you always read a user’s preferences with their profile, embed preferences. If you rarely read comments when viewing a post list, put comments in a separate collection.
Step 2 — Embedding Eliminates Joins at the Cost of Duplication
When you embed a user’s name in every task document, you pay disk space and update complexity for the benefit of being able to display task lists without joining the users collection. This is MongoDB’s fundamental trade-off — data is pre-joined at write time so reads are fast. The acceptable trade-off depends on how frequently the embedded data changes (rarely) vs how frequently the document is read (constantly).
Step 3 — Referencing Keeps Data Consistent at the Cost of Extra Queries
Storing user: ObjectId in a task means there is one authoritative place for the user’s name and email. When the user changes their name, only the user document needs updating. The task always shows the current name via Mongoose’s populate(). The downside: displaying a list of 100 tasks with user names requires either 101 queries (N+1 problem) or a $lookup aggregation.
Step 4 — Mongoose populate() Resolves References Automatically
Mongoose’s populate() method issues a second query to resolve ObjectId references: Task.find().populate('user', 'name email avatar') fetches all matching tasks, collects the unique user IDs, issues a single User.find({ _id: { $in: [ids] } }), and merges the results. It is two queries, not N+1. The result appears seamlessly in the returned document as if the user data were embedded.
Step 5 — Hybrid Patterns Combine Both Approaches
The most pragmatic approach is hybrid: reference the complete entity (for updates and independent access) while also embedding a subset of its most-needed fields for fast reads. Store user: ObjectId for relational integrity AND assigneeName: 'Alice' for fast display. Accept that if Alice changes her name, you need a migration to update all her tasks. This is the Extended Reference pattern — it is widely used in production MongoDB applications.
Real-World Example: Task Manager Complete Schema
// Complete schema decisions for the Task Manager application
// USERS — parent entity
const userSchema = new mongoose.Schema({
name: { type: String, required: true, trim: true },
email: { type: String, required: true, unique: true, lowercase: true },
password: { type: String, select: false }, // always excluded from queries
role: { type: String, enum: ['user', 'admin'], default: 'user' },
avatar: String,
// EMBED: small, always read with user, user-owned data
preferences: { theme: String, notifications: Boolean, timezone: String },
// EMBED: refresh tokens array (bounded — max 5 per user)
refreshTokens: [{
token: String,
expiresAt: Date,
device: String,
}],
}, { timestamps: true });
// TASKS — child entity, references parent
const taskSchema = new mongoose.Schema({
title: { type: String, required: true, maxlength: 200 },
description: { type: String, maxlength: 2000 },
status: { type: String, enum: ['pending', 'in-progress', 'completed'], default: 'pending' },
priority: { type: String, enum: ['low', 'medium', 'high'], default: 'medium' },
dueDate: Date,
completedAt: Date,
tags: [String], // EMBED: user-specific string tags, bounded, always with task
// REFERENCE: user can have many tasks; user updated independently
user: { type: mongoose.Types.ObjectId, ref: 'User', required: true },
// EMBED (one-to-few): attachments bounded to ~10 per task
attachments: [{
filename: String, url: String, size: Number, mimeType: String,
}],
// SOFT DELETE support
deletedAt: Date,
}, { timestamps: true });
// Indexes on tasks
taskSchema.index({ user: 1, status: 1, createdAt: -1 });
taskSchema.index({ user: 1, priority: 1, createdAt: -1 });
taskSchema.index({ user: 1, dueDate: 1 });
taskSchema.index({ tags: 1 });
taskSchema.index({ title: 'text', description: 'text' });
Common Mistakes
Mistake 1 — Storing array of child IDs in the parent for one-to-many
❌ Wrong — array grows without bound; hard to query:
const userSchema = new mongoose.Schema({
taskIds: [{ type: ObjectId, ref: 'Task' }], // grows to thousands!
// User document bloat; complex to paginate; difficult to filter tasks
});
✅ Correct — store the parent reference in the child document:
const taskSchema = new mongoose.Schema({
user: { type: ObjectId, ref: 'User', required: true }, // reference in child
// Query: Task.find({ user: userId }) — uses index, paginatable, filterable
});
Mistake 2 — Embedding unbounded arrays that grow indefinitely
❌ Wrong — activity log embedded in task grows without limit:
taskSchema = new mongoose.Schema({
activityLog: [{ // every status change, comment, edit logged here
action: String, timestamp: Date, userId: ObjectId,
}],
// After 100 changes: document is large. After 1000: near the 16MB limit!
});
✅ Correct — separate collection for unbounded data:
const activitySchema = new mongoose.Schema({
task: { type: ObjectId, ref: 'Task', required: true },
action: String,
userId: ObjectId,
timestamp: { type: Date, default: Date.now },
});
activitySchema.index({ task: 1, timestamp: -1 }); // fast timeline queries
Mistake 3 — Using populate() for large list queries (N+1 performance)
❌ Wrong — populate on a list of 1000 tasks creates extra round-trips:
const tasks = await Task.find({ userId }).populate('user');
// Mongoose: find 1000 tasks + find users for all unique userIds
// Better than true N+1, but still 2 queries for data that could be embedded
✅ Consider: embed display fields for frequently accessed list data:
// On task creation: embed the creator's display name
const task = await Task.create({ ...data, creatorName: req.user.name });
// List query: no populate needed — name is already in the document
Quick Reference — Decision Guide
| Data | Relationship | Pattern |
|---|---|---|
| User preferences | One-to-one | Embed in user document |
| Task attachments | One-to-few (bounded) | Embed array in task |
| User tasks | One-to-many | Reference (userId in task) |
| Task activity log | One-to-squillions | Separate collection |
| Post recent comments | Subset pattern | Embed top 5, reference rest |
| Creator name in task | Extended reference | Embed name + store userId |
| Task tags | Many-to-many (strings) | Embed string array in task |
| Task-to-user tags | Many-to-many (entities) | ObjectId array + Tag collection |