Generator Expressions and Lazy Pipelines

Generator expressions are the lazy equivalent of list comprehensions โ€” they use parentheses instead of square brackets and produce values one at a time rather than building an entire list in memory. When chained together, multiple generator expressions form a lazy pipeline where data flows through each stage without any intermediate collection being fully materialised. This is the most memory-efficient way to process large datasets in Python, and it is the pattern behind SQLAlchemy cursor iteration, file line processing, and FastAPI streaming responses. The performance difference between list and generator approaches becomes significant at scale.

Generator Expressions

# List comprehension โ€” creates the ENTIRE list in memory immediately
squares_list = [x ** 2 for x in range(1_000_000)]   # ~8MB in memory

# Generator expression โ€” lazy, produces one value at a time
squares_gen  = (x ** 2 for x in range(1_000_000))   # ~120 bytes (just the generator object)

# Same interface: both are iterable
for sq in squares_gen:
    if sq > 100:
        break

# Works with built-in functions that accept iterables
total     = sum(x ** 2 for x in range(1000))          # no list created
maximum   = max(len(line) for line in open("f.txt"))   # no list created
has_admin = any(u.role == "admin" for u in users)      # stops at first match
all_valid = all(is_valid(item) for item in items)      # stops at first failure

# Single-argument functions: outer () can be omitted
total = sum(x ** 2 for x in range(10))      # not sum((x ** 2 for x in range(10)))
result = ",".join(str(x) for x in [1,2,3]) # "1,2,3"

# Filtering with condition
evens = (x for x in range(20) if x % 2 == 0)
even_squares = (x ** 2 for x in range(20) if x % 2 == 0)
Note: Generator expressions use parentheses () while list comprehensions use square brackets []. The resulting objects behave differently: a list comprehension evaluates everything immediately and stores it all in RAM; a generator expression creates a generator object that produces values lazily. When you pass a generator expression directly as the only argument to a function, Python allows you to drop the extra set of parentheses: sum(x for x in range(10)) is valid.
Tip: For functions that consume an iterable and produce a single result โ€” sum(), max(), min(), any(), all(), ",".join() โ€” always prefer generator expressions over list comprehensions. The function iterates the input once and discards each item as it goes; building a full list first wastes memory proportional to the collection size. The ",".join(str(x) for x in items) pattern is especially important โ€” never use ",".join([str(x) for x in items]).
Warning: Generator expressions cannot be reused โ€” they are exhausted after one pass. If you need to iterate the same data multiple times (e.g., to find the max and also iterate over the data), store it in a list first: items = list(generator). Also, generator expressions capture variables by reference, not by value โ€” the late-binding bug from Chapter 9 applies here too. Avoid complex generator expressions with closures that capture loop variables.

Lazy Pipelines

import csv
from pathlib import Path

# โ”€โ”€ Lazy pipeline: read โ†’ filter โ†’ transform โ†’ limit โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# None of these generators actually reads or processes the file until iteration

# Stage 1: open file and yield lines (generator)
def read_lines(path: str):
    with open(path, encoding="utf-8") as f:
        yield from f

# Stage 2: parse CSV rows
def parse_csv(lines):
    reader = csv.DictReader(lines)
    yield from reader

# Stage 3: filter rows
def filter_active(rows):
    for row in rows:
        if row.get("active", "").lower() == "true":
            yield row

# Stage 4: transform rows
def extract_emails(rows):
    for row in rows:
        yield row["email"].strip().lower()

# Compose the pipeline โ€” no I/O or computation yet
lines  = read_lines("users.csv")
rows   = parse_csv(lines)
active = filter_active(rows)
emails = extract_emails(active)

# ONLY NOW does the file get read โ€” one line at a time
first_10_emails = []
for email in emails:
    first_10_emails.append(email)
    if len(first_10_emails) == 10:
        break
# File is closed by context manager when read_lines generator is GC'd

# โ”€โ”€ Equivalent with generator expressions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
with open("users.csv", encoding="utf-8") as f:
    active_emails = (
        row["email"].strip().lower()
        for row in csv.DictReader(f)
        if row.get("active", "").lower() == "true"
    )
    first_10 = list(islice(active_emails, 10))   # from itertools

Memory Comparison

import sys

# โ”€โ”€ Measure memory difference โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
n = 1_000_000

# List approach โ€” all values in RAM
squares_list = [x ** 2 for x in range(n)]
print(f"List size: {sys.getsizeof(squares_list):,} bytes")   # ~8,700,000 bytes

# Generator approach โ€” one value at a time
squares_gen = (x ** 2 for x in range(n))
print(f"Generator size: {sys.getsizeof(squares_gen)} bytes")   # 112 bytes

# Both produce the same sum:
# sum(squares_list) == sum(squares_gen) == 333332833333500000

# โ”€โ”€ When to use list vs generator โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Use LIST when:
# - You need to iterate multiple times
# - You need random access by index (items[3])
# - You need len() or reversed()
# - The collection is small (under ~1000 items)
# - You need to check membership frequently

# Use GENERATOR when:
# - Processing large datasets (files, DB results)
# - You only iterate once
# - Memory is a constraint
# - You want to start processing before all data is available (streaming)
# - The sequence is infinite

Common Mistakes

Mistake 1 โ€” Building a list just to pass to sum/max/all

โŒ Wrong โ€” unnecessarily materialises the list:

total = sum([x ** 2 for x in large_range])   # list built, then summed, then discarded

โœ… Correct โ€” generator expression:

total = sum(x ** 2 for x in large_range)   # โœ“ no list, one-pass streaming

Mistake 2 โ€” Mixing lazy pipeline with length check

โŒ Wrong โ€” generators have no len():

gen = (x for x in range(10))
print(len(gen))   # TypeError: object of type 'generator' has no len()

โœ… Correct โ€” convert to list first if length is needed:

items = list(x for x in range(10))
print(len(items))   # 10 โœ“

Mistake 3 โ€” Nested generator expressions that are hard to debug

โŒ Wrong โ€” difficult to understand and debug:

result = list(v for row in (parse(l) for l in open("f") if l.strip()) for v in row if v)

โœ… Correct โ€” named generator functions for each step:

lines  = (l for l in open("f") if l.strip())
rows   = (parse(l) for l in lines)
values = (v for row in rows for v in row if v)
result = list(values)   # โœ“ readable, debuggable

Quick Reference

Pattern Code Memory
List comprehension [expr for x in it] O(n) โ€” all at once
Generator expression (expr for x in it) O(1) โ€” one at a time
Sum without list sum(expr for x in it) O(1)
Any/all early exit any(cond for x in it) O(1) โ€” stops early
Lazy filter (x for x in it if cond) O(1)
Take first N islice(gen, N) O(N)
Chain generators yield from other_gen O(1)

🧠 Test Yourself

You have a CSV file with 500,000 rows. You need the email addresses of the first 100 active users. Which approach is most memory-efficient?