Advanced Aggregation — $lookup, $facet, $bucket, and $graphLookup

📚 MEAN Stack 📂 Chapter 8: MongoDB Intermediate 📄 Lesson 8030 Advanced 🕒 May 15, 2025

The foundational pipeline stages — $match, $group, $sort, $project — handle most analytical queries. But real applications demand more: joining data from multiple collections ($lookup), computing multiple independent analytics in a single query ($facet), bucketing numeric data into ranges ($bucket), and traversing hierarchical or graph-structured data ($graphLookup). These advanced stages are what separate basic reporting from true data intelligence. This lesson covers each one with real-world task manager examples that you can adapt directly to your own applications.

Advanced Stage Overview

Stage	Purpose	SQL Equivalent
`$lookup`	Join documents from another collection	LEFT OUTER JOIN
`$facet`	Run multiple sub-pipelines in parallel on the same input	Multiple SELECT queries in one
`$bucket`	Group values into manually defined ranges	CASE WHEN BETWEEN … GROUP BY
`$bucketAuto`	Auto-distribute values into N equal-sized buckets	NTILE(N) GROUP BY
`$graphLookup`	Recursive traversal of a graph structure	Recursive CTE (WITH RECURSIVE)
`$out`	Write pipeline results to a collection	SELECT INTO / CREATE TABLE AS
`$merge`	Merge pipeline results into an existing collection	MERGE / UPSERT
`$unionWith`	Combine results from two collections	UNION ALL

$lookup Variants

Variant	Syntax	Use When
Simple equality join	`{ from, localField, foreignField, as }`	Join on matching field values
Pipeline lookup	`{ from, let, pipeline, as }`	Complex join conditions, filtering joined data
Uncorrelated subquery	`{ from, pipeline, as }`	Join entire collection with no correlation

Note: $lookup performs a LEFT OUTER JOIN — every document from the left collection is included in the output, even if no matching documents exist in the joined collection. The result is an array field (specified by as) containing the matched documents. If no documents match, the array is empty. Use { $unwind: { path: '$joinedField', preserveNullAndEmptyArrays: false } } after $lookup to convert to an INNER JOIN (excluding unmatched documents).

Tip: MongoDB embeds related data in documents specifically to avoid joins. If you find yourself writing many $lookup stages to assemble data you frequently need together, consider denormalising — embedding that data directly in the document instead. Use $lookup for genuinely optional or infrequently needed joins, not as a workaround for a normalised data model that would be better restructured for document databases.

Warning: $lookup is expensive for large collections — it performs a join for every document in the pipeline. Always use the pipeline variant to add a $match stage inside the lookup to filter joined documents, reducing the join’s cost. Also place your $lookup after a $match that has already reduced the number of documents in the pipeline — joining 50 documents is far cheaper than joining 50,000.

Complete Advanced Pipeline Examples

const { Types: { ObjectId } } = require('mongoose');

// ── $lookup — join tasks with user data ───────────────────────────────────
// Get admin users' tasks with user details embedded

const tasksWithUsers = await Task.aggregate([
    { $match: { status: 'in-progress' } },

    // Simple $lookup: join users collection on task.user = user._id
    {
        $lookup: {
            from:         'users',      // collection name (not Mongoose model name)
            localField:   'user',       // field in tasks collection
            foreignField: '_id',        // field in users collection
            as:           'userDetails',
        },
    },

    // Convert userDetails array to single object (INNER JOIN — exclude unmatched)
    { $unwind: '$userDetails' },

    // Keep only fields we need
    {
        $project: {
            title:             1,
            status:            1,
            'userDetails.name':  1,
            'userDetails.email': 1,
            'userDetails.role':  1,
        },
    },
]);

// ── Pipeline $lookup — join with filtering ────────────────────────────────
// Get users with their active task count and most recent task

const usersWithTasks = await User.aggregate([
    { $match: { role: 'admin' } },

    {
        $lookup: {
            from: 'tasks',
            let:  { userId: '$_id' },   // variable from left side
            pipeline: [
                {
                    $match: {
                        $expr: {
                            $and: [
                                { $eq: ['$$userId', '$user'] },         // join condition
                                { $ne: ['$status', 'completed'] },      // filter joined docs
                            ],
                        },
                    },
                },
                { $sort: { createdAt: -1 } },
                { $limit: 5 },
                { $project: { title: 1, priority: 1, status: 1 } },
            ],
            as: 'activeTasks',
        },
    },

    {
        $addFields: {
            activeTaskCount: { $size: '$activeTasks' },
        },
    },

    { $sort: { activeTaskCount: -1 } },
]);

// ── $facet — multiple analytics in one query ──────────────────────────────
const [analytics] = await Task.aggregate([
    { $match: { user: new ObjectId(userId) } },

    {
        $facet: {
            // Status distribution
            statusBreakdown: [
                { $group: { _id: '$status', count: { $sum: 1 } } },
                { $sort: { count: -1 } },
            ],

            // Tasks by month (last 12 months)
            monthlyCreated: [
                { $match: { createdAt: { $gte: new Date(Date.now() - 365 * 86400000) } } },
                {
                    $group: {
                        _id: {
                            year:  { $year: '$createdAt' },
                            month: { $month: '$createdAt' },
                        },
                        count: { $sum: 1 },
                    },
                },
                { $sort: { '_id.year': 1, '_id.month': 1 } },
            ],

            // Completion rate
            completionRate: [
                {
                    $group: {
                        _id:       null,
                        total:     { $sum: 1 },
                        completed: { $sum: { $cond: [{ $eq: ['$status', 'completed'] }, 1, 0] } },
                    },
                },
                {
                    $project: {
                        _id:  0,
                        rate: {
                            $round: [
                                { $multiply: [{ $divide: ['$completed', '$total'] }, 100] },
                                1,
                            ],
                        },
                    },
                },
            ],

            // Top tags
            topTags: [
                { $unwind: '$tags' },
                { $group: { _id: '$tags', count: { $sum: 1 } } },
                { $sort: { count: -1 } },
                { $limit: 5 },
            ],
        },
    },
]);

// ── $bucket — group by completion time ranges ──────────────────────────────
const completionTimeDistribution = await Task.aggregate([
    {
        $match: {
            user:        new ObjectId(userId),
            status:      'completed',
            completedAt: { $exists: true },
        },
    },
    {
        $addFields: {
            completionHours: {
                $divide: [
                    { $subtract: ['$completedAt', '$createdAt'] },
                    3600000,  // ms to hours
                ],
            },
        },
    },
    {
        $bucket: {
            groupBy:    '$completionHours',
            boundaries: [0, 1, 4, 24, 72, 168, Infinity],  // 0-1h, 1-4h, 4-24h, 1-3d, 3d-1wk, 1wk+
            default:    'over-1-week',
            output: {
                count: { $sum: 1 },
                tasks: { $push: '$title' },
            },
        },
    },
    {
        $addFields: {
            range: {
                $switch: {
                    branches: [
                        { case: { $eq: ['$_id', 0] },   then: 'Under 1 hour' },
                        { case: { $eq: ['$_id', 1] },   then: '1-4 hours' },
                        { case: { $eq: ['$_id', 4] },   then: '4-24 hours' },
                        { case: { $eq: ['$_id', 24] },  then: '1-3 days' },
                        { case: { $eq: ['$_id', 72] },  then: '3-7 days' },
                        { case: { $eq: ['$_id', 168] }, then: 'Over 1 week' },
                    ],
                    default: 'Unknown',
                },
            },
        },
    },
]);

// ── $graphLookup — find all tasks in a category hierarchy ─────────────────
// Suppose categories have: { _id, name, parentId }

const allSubcategories = await Category.aggregate([
    // Start from the root category
    { $match: { _id: new ObjectId(rootCategoryId) } },

    {
        $graphLookup: {
            from:             'categories',     // collection to traverse
            startWith:        '$_id',           // initial value(s) to look up
            connectFromField: '_id',            // field in the traversed docs
            connectToField:   'parentId',       // field pointing to parent
            as:               'descendants',    // output array field name
            maxDepth:         10,               // prevent infinite loops
            depthField:       'depth',          // track traversal depth
        },
    },
]);

How It Works

Step 1 — $lookup Executes a Subquery per Document

For each document in the pipeline, $lookup queries the foreign collection for matching documents. Simple equality lookup translates to: “for each task, find all users where _id equals task.user.” The pipeline variant lets you pass variables from the current document into the subquery with let and reference them in $match using $expr — this is how you filter joined documents while still using the join condition.

$facet takes the stream of documents at that point in the pipeline and passes it simultaneously to multiple named sub-pipelines. Each sub-pipeline runs independently on the same input — one might group by status, another filter for overdue, another unwind tags. All results arrive in a single aggregation output document. This is far more efficient than running multiple separate aggregations — you incur the cost of filtering to the user’s documents only once.

Step 3 — $bucket Distributes Numeric Values into Ranges

$bucket requires you to specify the boundaries array — values that define the edges of each bucket. Values are assigned to a bucket if they are >= the left boundary and < the right boundary. The default option catches values outside all specified boundaries. $bucketAuto is the automatic variant — you specify only the number of buckets and MongoDB distributes values evenly, making it useful for exploratory analysis where you do not know the distribution in advance.

Step 4 — $graphLookup Traverses Recursive Relationships

$graphLookup is a recursive lookup. Starting from the documents in the pipeline, it follows references — connectFromField to connectToField — as many levels as needed (up to maxDepth). It accumulates all discovered documents in the output array. This is ideal for organisational hierarchies (employee reports to manager), nested categories, comment threads, and any data structure where items reference other items in the same collection.

Step 5 — $out and $merge Write Results to Collections

$out writes the pipeline output to a named collection, replacing it entirely. $merge merges results into an existing collection — matching documents are updated; unmatched documents are inserted, ignored, or replaced depending on configuration. These stages enable materialised views — pre-computed aggregation results stored as documents, refreshed on a schedule. Querying a materialised view is as fast as a simple find(), even for complex aggregations that would otherwise take seconds.

Common Mistakes

Mistake 1 — $lookup without filtering in the pipeline — full join

❌ Wrong — joins all documents in the foreign collection:

{
    $lookup: {
        from: 'tasks',
        let:  { userId: '$_id' },
        pipeline: [
            // No $match — returns ALL tasks, not just this user's!
            { $limit: 10 }
        ],
        as: 'tasks',
    }
}

✅ Correct — always filter inside the pipeline lookup:

pipeline: [
    { $match: { $expr: { $eq: ['$user', '$$userId'] } } },  // filter first
    { $sort: { createdAt: -1 } },
    { $limit: 10 },
]

Mistake 2 — Placing $lookup before $match — joins all documents

❌ Wrong — performs join on entire collection before filtering:

Task.aggregate([
    { $lookup: { from: 'users', localField: 'user', foreignField: '_id', as: 'user' } },
    { $match: { 'user.role': 'admin' } },   // too late — joined all tasks then filtered
]);

✅ Correct — filter with $match before $lookup to minimise join size:

Task.aggregate([
    { $match: { status: 'in-progress' } },  // reduce documents first
    { $lookup: { from: 'users', ... } },     // then join the smaller set
]);

Mistake 3 — $bucketAuto when you need specific business-meaningful ranges

❌ Wrong — auto boundaries don’t align with business definitions:

{ $bucketAuto: { groupBy: '$completionHours', buckets: 5 } }
// MongoDB creates arbitrary boundaries like: 0-2.4h, 2.4-8.1h, etc.
// Meaningless to users

✅ Correct — use $bucket with business-meaningful boundaries:

{ $bucket: { groupBy: '$completionHours', boundaries: [0, 1, 4, 24, 72], default: 'over-3-days' } }
// Meaningful ranges: under 1h, 1-4h, 4-24h, 1-3 days

Quick Reference

Need	Stage
Join another collection	`{ $lookup: { from, localField, foreignField, as } }`
Conditional join / filtered join	`{ $lookup: { from, let, pipeline, as } }`
Multiple stats in one query	`{ $facet: { stat1: [...], stat2: [...] } }`
Bucket by range	`{ $bucket: { groupBy, boundaries, default, output } }`
Auto-distribute buckets	`{ $bucketAuto: { groupBy, buckets, output } }`
Traverse hierarchy	`{ $graphLookup: { from, startWith, connectFromField, connectToField, as } }`
Write to collection	`{ $out: 'collectionName' }`
Merge into collection	`{ $merge: { into, on, whenMatched, whenNotMatched } }`

Advanced Aggregation — $lookup, $facet, $bucket, and $graphLookup

Advanced Stage Overview

$lookup Variants

Complete Advanced Pipeline Examples

How It Works

Step 1 — $lookup Executes a Subquery per Document

Step 2 — $facet Creates Independent Sub-pipelines

Step 3 — $bucket Distributes Numeric Values into Ranges

Step 4 — $graphLookup Traverses Recursive Relationships

Step 5 — $out and $merge Write Results to Collections

Common Mistakes

Mistake 1 — $lookup without filtering in the pipeline — full join

Mistake 2 — Placing $lookup before $match — joins all documents

Mistake 3 — $bucketAuto when you need specific business-meaningful ranges

Quick Reference

🧠 Test Yourself

Advanced Stage Overview #

$lookup Variants #

Complete Advanced Pipeline Examples #

How It Works #

Step 1 — $lookup Executes a Subquery per Document #

Step 2 — $facet Creates Independent Sub-pipelines #

Step 3 — $bucket Distributes Numeric Values into Ranges #

Step 4 — $graphLookup Traverses Recursive Relationships #

Step 5 — $out and $merge Write Results to Collections #

Common Mistakes #

Mistake 1 — $lookup without filtering in the pipeline — full join #

Mistake 2 — Placing $lookup before $match — joins all documents #

Mistake 3 — $bucketAuto when you need specific business-meaningful ranges #

Quick Reference #

🧠 Test Yourself #

📚 More in this Tutorial Series

Advanced Stage Overview

$lookup Variants

Complete Advanced Pipeline Examples

How It Works

Step 1 — $lookup Executes a Subquery per Document

Step 2 — $facet Creates Independent Sub-pipelines

Step 3 — $bucket Distributes Numeric Values into Ranges

Step 4 — $graphLookup Traverses Recursive Relationships

Step 5 — $out and $merge Write Results to Collections

Common Mistakes

Mistake 1 — $lookup without filtering in the pipeline — full join

Mistake 2 — Placing $lookup before $match — joins all documents

Mistake 3 — $bucketAuto when you need specific business-meaningful ranges

Quick Reference

🧠 Test Yourself