The Day I Learned SQL Doesn’t Care About My Feelings

    Vishnu Kainadan

    The Day I Learned SQL Doesn’t Care About My Feelings

    Introduction

    Last Thursday, right after standup, one of our senior engineers walked over to my desk.

    “Got an hour?” he asked.
    “Let’s look at how queries actually run.”

    No alarms. No urgency. Just curiosity.

    What followed wasn’t a lecture or a code review. It was a quiet, focused session—one of those conversations that subtly rewires how you think about a core tool you use every day.

    We didn’t talk about syntax.
    We didn’t talk about ORMs.

    We talked about how the database thinks.

    This is what I learned.

    The Ghost in the Machine

    What Is the Query Planner, Really?

    When you write SQL, you’re not writing instructions—you’re writing a wish list.

    \r SELECT *\r FROM orders\r WHERE customer_id = 42;
    \r

    This doesn’t tell PostgreSQL how to find those rows.
    It just says what you want.

    The query planner is the invisible architect that figures out the how.

    It evaluates multiple strategies:

    • Should I scan the whole table?
    • Should I use an index?
    • In what order should I join tables?

    Then it picks the cheapest option based on:

    • Statistics
    • Row estimates
    • Available indexes

    The Catch

    The planner doesn’t always get it right.

    It makes educated guesses using:

    • Outdated statistics
    • Incomplete information
    • Assumptions about data distribution

    Two queries that look similar can behave very differently.

    \r — Query A\r \r SELECT *FROM users WHERE email = 'alice@example.com';
    \r \r — Query B\r \r SELECT * FROM users WHERE status = 'active';
    \r

    If email has an index and status doesn’t:

    • Query A → instant
    • Query B → scans millions of rows

    The planner knows this.
    You need to know it too.

    Key Insight
    The planner is cost-based, not rule-based.
    Having an index does not guarantee it’ll be used.

    If status = ‘active’ matches 90% of rows, PostgreSQL may ignore your beautiful index entirely.

    CTEs: The Helper You Trust a Bit Too Much

    What’s a CTE?

    A Common Table Expression (CTE) is a named subquery you can reference later.

    \r
    WITH recent_orders AS (\r
    \r
      SELECT user_id, order_id, total_amount\r
      FROM orders\r
      WHERE created_at > NOW()  - INTERVAL '7 days'\r
    )\r
    \r
    SELECT user_id,\r
      COUNT(*) AS order_count,\r
      SUM(total_amount) AS total_spent\r
    FROM recent_orders\r
    GROUP BY user_id;\r
    

    Developers love CTEs because they:

    • Improve readability
    • Replace deeply nested subqueries
    • Make complex logic feel clean and logical

    But here’s what many junior devs miss:

    CTEs can lie to the optimizer.

    Optimization Fences (The Old Days)

    In PostgreSQL versions before 12, CTEs were always materialized.

    That means:

    1. PostgreSQL executes the CTE
    2. Stores the result
    3. Uses that stored result—even if it blocks better optimizations

    Modern PostgreSQL is smarter, but materialization still matters.

    When the Database Takes Your CTE Literally

    What Is Materialization?

    Materialization means:

    • The CTE runs once
    • Results are written to memory (or disk)
    • The rest of the query treats it like a physical table

    When It’s Good

    \r
    WITH expensive_calculation AS (\r
    \r
      SELECT product_id, AVG(rating) AS avg_rating\r
      FROM reviews\r
      WHERE review_date > '2024-01-01'\r
      GROUP BY product_id\r
    )\r
    \r
    SELECT p.name, ec.avg_rating\r
    FROM products p\r
    JOIN expensive_calculation ec\r
      ON p.id = ec.product_id\r
    \r
    WHERE ec.avg_rating > 4.5;\r
    

    If expensive_calculation is reused multiple times:

    • Compute once
    • Reuse everywhere
    • 👍 Worth it

    When It’s Terrible

    \r
    WITH all_users AS (\r
      SELECT * FROM users -- 10 million rows\r
    )\r
    \r
    SELECT * FROM all_users\r
    WHERE account_type = 'premium';\r
    

    Here:

    • PostgreSQL materializes 10 million rows
    • Then applies the filter

    Without the CTE, PostgreSQL could:

    • Push the filter down
    • Use an index

    Avoid loading everything

    PostgreSQL 12+ Controls

    \r — Force materialization\r WITH active_users AS MATERIALIZED (...)\r \r — Let the planner inline it\r WITH active_users AS NOT MATERIALIZED (...)\r

    Rule of Thumb

    • Used once + downstream filters → NOT MATERIALIZED
    • Expensive + reused → MATERIALIZED

    Two Ways to Dance: Nested Loop vs Hash Join

    Why Joins Decide Performance

    Joins are where queries live or die.

    PostgreSQL chooses join strategies based on:

    • Data size
    • Indexes
    • Available memory

    Nested Loop Join

    Conceptually:

    for each row in outer_table:

    find matching rows in inner_table

    Example:

    \r
    SELECT  o.order_id, c.name\r
    \r
    FROM orders o\r
    JOIN customers c \r
      ON o.customer_id = c.id\r
    \r
    WHERE o.created_at = '2025-01-10';\r
    

    If:

    • The date filter returns ~50 rows
    • id is indexed

    → Perfect use case.

    When it’s good

    • Small outer table
    • Indexed inner table
    • High selectivity

    When it’s a disaster

    • Large outer table
    • No index
    • Two big tables joined blindly

    Hash Join

    \r
    SELECT u.name, COUNT(p.post_id) AS post_count\r
    FROM users u\r
    JOIN posts p \r
      ON u.id = p.author_id\r
    GROUP BY u.name;\r
    

    How it works:

    1. Build a hash table in memory
    2. Probe it for matches

    Time complexity: O(n + m)

    When it’s good

    • Large tables
    • Equality joins
    • Enough work_mem

    The catch
    If the hash doesn’t fit in memory → disk spill → slow query.

    Breaking the Fourth Wall: LATERAL Joins

    What Is LATERAL?

    A LATERAL join allows a subquery to reference columns from earlier tables.

    Perfect for Top-N per group problems.

    \r
    SELECT c.name,  recent.order_id,  recent.created_at\r
    FROM customers c\r
    CROSS JOIN LATERAL (\r
      SELECT order_id, created_at\r
      FROM orders\r
      WHERE orders.customer_id = c.id\r
      ORDER BY created_at DESC\r
      LIMIT 3\r
    ) \r
    recent;\r
    

    Why it’s powerful

    • Clean syntax
    • Avoids window-function complexity
    • Planner can optimize it well

    Why it’s risky

    • Subquery runs once per outer row
    • 100k customers → 100k subqueries

    Use it when

    • Outer table is small or heavily filtered
    • Proper indexes exist

    Why GROUP BY Is Quietly Expensive

    The Hidden Cost of Aggregation

    GROUP BY does more than it looks:

    1. Groups rows
    2. Applies aggregates
    3. Returns one row per group

    Two strategies:

    Hash Aggregation

    • Fast
    • Memory-heavy

    Group Aggregation

    • Sorts first
    • Slower without pre-sorted data
    \r SELECT department, COUNT(*) AS employee_count\r FROM employees\r GROUP BY department;
    \r

    If there are 50 departments → tiny hash table → efficient.

    But grouping by a unique column?
    You’re rebuilding the entire table in memory.

    Optimization tip
    Only group by what you need.
    Group by IDs, join names later.

    Indexes that match GROUP BY columns can eliminate sorting entirely.

    The Autopsy Report

    Why EXPLAIN ANALYZE Is Your Best Friend

    You can’t optimize what you don’t measure.

    \r EXPLAIN (ANALYZE, BUFFERS)\r SELECT department, COUNT(*) FROM employees\r WHERE hire_date > '2024-01-01'\r GROUP BY department;
    \r

    Key things to watch:

    • Seq Scan on large tables → missing index
    • Estimated vs actual rows → stale statistics
    • Nested Loop with many loops → danger
    • High buffer reads → disk I/O

    The most important line:

    actual time=45.231..45.298 rows=48 loops=1

    If loops = 10,000, even “fast” steps become slow.

    Before vs After

    CREATE INDEX idx_hire_date ON employees (hire_date);

    Rerun:

    Index Scan using idx_hire_date

    (actual time=0.045..12.123)

    45ms → 12ms

    Measurement wins.

    Closing Thoughts: Thinking Like the Database

    What stayed with me wasn’t a trick or command—it was a shift in perspective.

    SQL isn’t a black box.
    Performance issues aren’t random.

    Every slow query is the result of a logical decision made by the planner using the information it had.

    Senior developers don’t fight the database.
    They listen to it.

    They:

    • Open EXPLAIN ANALYZE
    • Read the plan top to bottom
    • Ask simple questions:
      • Why this scan?
      • Why this join?
      • Why this estimate?

    Then they change one thing and measure again.

    That’s the real skill.

    Not memorizing syntax—but learning how to have a conversation with the planner.

    And once you do, SQL stops hurting your feelings.

    See More Articles

    We use cookies to improve your experience.