The foundational pipeline stages — $match, $group, $sort, $project — handle most analytical queries. But real applications demand more: joining data from multiple collections ($lookup), computing multiple independent analytics in a single query ($facet), bucketing numeric data into ranges ($bucket), and traversing hierarchical or graph-structured data ($graphLookup). These advanced stages are what separate basic reporting from true data intelligence. This lesson covers each one with real-world task manager examples that you can adapt directly to your own applications.
Advanced Stage Overview
| Stage | Purpose | SQL Equivalent |
|---|---|---|
$lookup |
Join documents from another collection | LEFT OUTER JOIN |
$facet |
Run multiple sub-pipelines in parallel on the same input | Multiple SELECT queries in one |
$bucket |
Group values into manually defined ranges | CASE WHEN BETWEEN … GROUP BY |
$bucketAuto |
Auto-distribute values into N equal-sized buckets | NTILE(N) GROUP BY |
$graphLookup |
Recursive traversal of a graph structure | Recursive CTE (WITH RECURSIVE) |
$out |
Write pipeline results to a collection | SELECT INTO / CREATE TABLE AS |
$merge |
Merge pipeline results into an existing collection | MERGE / UPSERT |
$unionWith |
Combine results from two collections | UNION ALL |
$lookup Variants
| Variant | Syntax | Use When |
|---|---|---|
| Simple equality join | { from, localField, foreignField, as } |
Join on matching field values |
| Pipeline lookup | { from, let, pipeline, as } |
Complex join conditions, filtering joined data |
| Uncorrelated subquery | { from, pipeline, as } |
Join entire collection with no correlation |
$lookup performs a LEFT OUTER JOIN — every document from the left collection is included in the output, even if no matching documents exist in the joined collection. The result is an array field (specified by as) containing the matched documents. If no documents match, the array is empty. Use { $unwind: { path: '$joinedField', preserveNullAndEmptyArrays: false } } after $lookup to convert to an INNER JOIN (excluding unmatched documents).$lookup stages to assemble data you frequently need together, consider denormalising — embedding that data directly in the document instead. Use $lookup for genuinely optional or infrequently needed joins, not as a workaround for a normalised data model that would be better restructured for document databases.$lookup is expensive for large collections — it performs a join for every document in the pipeline. Always use the pipeline variant to add a $match stage inside the lookup to filter joined documents, reducing the join’s cost. Also place your $lookup after a $match that has already reduced the number of documents in the pipeline — joining 50 documents is far cheaper than joining 50,000.Complete Advanced Pipeline Examples
const { Types: { ObjectId } } = require('mongoose');
// ── $lookup — join tasks with user data ───────────────────────────────────
// Get admin users' tasks with user details embedded
const tasksWithUsers = await Task.aggregate([
{ $match: { status: 'in-progress' } },
// Simple $lookup: join users collection on task.user = user._id
{
$lookup: {
from: 'users', // collection name (not Mongoose model name)
localField: 'user', // field in tasks collection
foreignField: '_id', // field in users collection
as: 'userDetails',
},
},
// Convert userDetails array to single object (INNER JOIN — exclude unmatched)
{ $unwind: '$userDetails' },
// Keep only fields we need
{
$project: {
title: 1,
status: 1,
'userDetails.name': 1,
'userDetails.email': 1,
'userDetails.role': 1,
},
},
]);
// ── Pipeline $lookup — join with filtering ────────────────────────────────
// Get users with their active task count and most recent task
const usersWithTasks = await User.aggregate([
{ $match: { role: 'admin' } },
{
$lookup: {
from: 'tasks',
let: { userId: '$_id' }, // variable from left side
pipeline: [
{
$match: {
$expr: {
$and: [
{ $eq: ['$$userId', '$user'] }, // join condition
{ $ne: ['$status', 'completed'] }, // filter joined docs
],
},
},
},
{ $sort: { createdAt: -1 } },
{ $limit: 5 },
{ $project: { title: 1, priority: 1, status: 1 } },
],
as: 'activeTasks',
},
},
{
$addFields: {
activeTaskCount: { $size: '$activeTasks' },
},
},
{ $sort: { activeTaskCount: -1 } },
]);
// ── $facet — multiple analytics in one query ──────────────────────────────
const [analytics] = await Task.aggregate([
{ $match: { user: new ObjectId(userId) } },
{
$facet: {
// Status distribution
statusBreakdown: [
{ $group: { _id: '$status', count: { $sum: 1 } } },
{ $sort: { count: -1 } },
],
// Tasks by month (last 12 months)
monthlyCreated: [
{ $match: { createdAt: { $gte: new Date(Date.now() - 365 * 86400000) } } },
{
$group: {
_id: {
year: { $year: '$createdAt' },
month: { $month: '$createdAt' },
},
count: { $sum: 1 },
},
},
{ $sort: { '_id.year': 1, '_id.month': 1 } },
],
// Completion rate
completionRate: [
{
$group: {
_id: null,
total: { $sum: 1 },
completed: { $sum: { $cond: [{ $eq: ['$status', 'completed'] }, 1, 0] } },
},
},
{
$project: {
_id: 0,
rate: {
$round: [
{ $multiply: [{ $divide: ['$completed', '$total'] }, 100] },
1,
],
},
},
},
],
// Top tags
topTags: [
{ $unwind: '$tags' },
{ $group: { _id: '$tags', count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 5 },
],
},
},
]);
// ── $bucket — group by completion time ranges ──────────────────────────────
const completionTimeDistribution = await Task.aggregate([
{
$match: {
user: new ObjectId(userId),
status: 'completed',
completedAt: { $exists: true },
},
},
{
$addFields: {
completionHours: {
$divide: [
{ $subtract: ['$completedAt', '$createdAt'] },
3600000, // ms to hours
],
},
},
},
{
$bucket: {
groupBy: '$completionHours',
boundaries: [0, 1, 4, 24, 72, 168, Infinity], // 0-1h, 1-4h, 4-24h, 1-3d, 3d-1wk, 1wk+
default: 'over-1-week',
output: {
count: { $sum: 1 },
tasks: { $push: '$title' },
},
},
},
{
$addFields: {
range: {
$switch: {
branches: [
{ case: { $eq: ['$_id', 0] }, then: 'Under 1 hour' },
{ case: { $eq: ['$_id', 1] }, then: '1-4 hours' },
{ case: { $eq: ['$_id', 4] }, then: '4-24 hours' },
{ case: { $eq: ['$_id', 24] }, then: '1-3 days' },
{ case: { $eq: ['$_id', 72] }, then: '3-7 days' },
{ case: { $eq: ['$_id', 168] }, then: 'Over 1 week' },
],
default: 'Unknown',
},
},
},
},
]);
// ── $graphLookup — find all tasks in a category hierarchy ─────────────────
// Suppose categories have: { _id, name, parentId }
const allSubcategories = await Category.aggregate([
// Start from the root category
{ $match: { _id: new ObjectId(rootCategoryId) } },
{
$graphLookup: {
from: 'categories', // collection to traverse
startWith: '$_id', // initial value(s) to look up
connectFromField: '_id', // field in the traversed docs
connectToField: 'parentId', // field pointing to parent
as: 'descendants', // output array field name
maxDepth: 10, // prevent infinite loops
depthField: 'depth', // track traversal depth
},
},
]);
How It Works
Step 1 — $lookup Executes a Subquery per Document
For each document in the pipeline, $lookup queries the foreign collection for matching documents. Simple equality lookup translates to: “for each task, find all users where _id equals task.user.” The pipeline variant lets you pass variables from the current document into the subquery with let and reference them in $match using $expr — this is how you filter joined documents while still using the join condition.
Step 2 — $facet Creates Independent Sub-pipelines
$facet takes the stream of documents at that point in the pipeline and passes it simultaneously to multiple named sub-pipelines. Each sub-pipeline runs independently on the same input — one might group by status, another filter for overdue, another unwind tags. All results arrive in a single aggregation output document. This is far more efficient than running multiple separate aggregations — you incur the cost of filtering to the user’s documents only once.
Step 3 — $bucket Distributes Numeric Values into Ranges
$bucket requires you to specify the boundaries array — values that define the edges of each bucket. Values are assigned to a bucket if they are >= the left boundary and < the right boundary. The default option catches values outside all specified boundaries. $bucketAuto is the automatic variant — you specify only the number of buckets and MongoDB distributes values evenly, making it useful for exploratory analysis where you do not know the distribution in advance.
Step 4 — $graphLookup Traverses Recursive Relationships
$graphLookup is a recursive lookup. Starting from the documents in the pipeline, it follows references — connectFromField to connectToField — as many levels as needed (up to maxDepth). It accumulates all discovered documents in the output array. This is ideal for organisational hierarchies (employee reports to manager), nested categories, comment threads, and any data structure where items reference other items in the same collection.
Step 5 — $out and $merge Write Results to Collections
$out writes the pipeline output to a named collection, replacing it entirely. $merge merges results into an existing collection — matching documents are updated; unmatched documents are inserted, ignored, or replaced depending on configuration. These stages enable materialised views — pre-computed aggregation results stored as documents, refreshed on a schedule. Querying a materialised view is as fast as a simple find(), even for complex aggregations that would otherwise take seconds.
Common Mistakes
Mistake 1 — $lookup without filtering in the pipeline — full join
❌ Wrong — joins all documents in the foreign collection:
{
$lookup: {
from: 'tasks',
let: { userId: '$_id' },
pipeline: [
// No $match — returns ALL tasks, not just this user's!
{ $limit: 10 }
],
as: 'tasks',
}
}
✅ Correct — always filter inside the pipeline lookup:
pipeline: [
{ $match: { $expr: { $eq: ['$user', '$$userId'] } } }, // filter first
{ $sort: { createdAt: -1 } },
{ $limit: 10 },
]
Mistake 2 — Placing $lookup before $match — joins all documents
❌ Wrong — performs join on entire collection before filtering:
Task.aggregate([
{ $lookup: { from: 'users', localField: 'user', foreignField: '_id', as: 'user' } },
{ $match: { 'user.role': 'admin' } }, // too late — joined all tasks then filtered
]);
✅ Correct — filter with $match before $lookup to minimise join size:
Task.aggregate([
{ $match: { status: 'in-progress' } }, // reduce documents first
{ $lookup: { from: 'users', ... } }, // then join the smaller set
]);
Mistake 3 — $bucketAuto when you need specific business-meaningful ranges
❌ Wrong — auto boundaries don’t align with business definitions:
{ $bucketAuto: { groupBy: '$completionHours', buckets: 5 } }
// MongoDB creates arbitrary boundaries like: 0-2.4h, 2.4-8.1h, etc.
// Meaningless to users
✅ Correct — use $bucket with business-meaningful boundaries:
{ $bucket: { groupBy: '$completionHours', boundaries: [0, 1, 4, 24, 72], default: 'over-3-days' } }
// Meaningful ranges: under 1h, 1-4h, 4-24h, 1-3 days
Quick Reference
| Need | Stage |
|---|---|
| Join another collection | { $lookup: { from, localField, foreignField, as } } |
| Conditional join / filtered join | { $lookup: { from, let, pipeline, as } } |
| Multiple stats in one query | { $facet: { stat1: [...], stat2: [...] } } |
| Bucket by range | { $bucket: { groupBy, boundaries, default, output } } |
| Auto-distribute buckets | { $bucketAuto: { groupBy, buckets, output } } |
| Traverse hierarchy | { $graphLookup: { from, startWith, connectFromField, connectToField, as } } |
| Write to collection | { $out: 'collectionName' } |
| Merge into collection | { $merge: { into, on, whenMatched, whenNotMatched } } |