Talha IjlalTalha Ijlal
arrow_backBack

Building Scalable ETL with Airflow

An exploration of DAG optimization techniques for handling high-throughput, mission-critical data pipelines with minimal latency.

Airflow is easy to get running and surprisingly hard to keep boring.

Most teams start with a few DAGs and a couple of operators. Then the number of pipelines grows, backfills become common, and failures stop being “rare events” and become a daily operational surface. The scaling problem is not “Airflow can’t run enough tasks.” It is that your pipeline design forces humans to do too much thinking under pressure.

This post is a practical checklist of patterns that keep ETL predictable as volume, team size, and expectations increase.

The Core Thesis

Airflow scales when you:

  • Treat DAGs as a product surface (not a side script).
  • Make every task idempotent and partition-aware.
  • Design for backfills, retries, and observability from day one.

Everything else is implementation detail.

Design for Backfills First

Backfills are where hidden coupling is exposed. If your DAG assumes “today’s state,” then backfilling becomes a manual, error-prone ritual.

Patterns that help:

  • Partition inputs explicitly: date, hour, tenant, source, whatever your domain needs.
  • Avoid implicit “latest” behavior in tasks; pass partitions in.
  • Ensure downstream tables can be rebuilt deterministically from raw sources.

If you can’t backfill safely, you also can’t debug safely.

Idempotency Is Not Optional

Idempotency is the difference between “retry” and “duplicate.”

In practice that means:

  • Prefer upserts/merge semantics over blind inserts.
  • Use stable keys derived from source identifiers, not “current time.”
  • Keep raw ingestion append-only when possible, and apply transforms on top.

The faster your system can retry, the less you fear failures.

Keep Tasks Small and Composable

Airflow rewards small, well-named tasks. It punishes “do everything” operators that hide multiple concerns behind one green checkmark.

A useful split:

  • extract_*: pull bytes from a source (API, file, HTML, PDF).
  • normalize_*: parse and clean into a stable intermediate schema.
  • load_*: write to the warehouse/database with constraints.
  • validate_*: sanity checks, constraints, and row-count expectations.
  • publish_*: materialize views, search indexes, or downstream assets.

This is not about purity. It’s about giving yourself handles to pull during incidents.

Measure Scheduling and Queuing, Not Just Runtime

Teams often watch task duration and miss the real problem: tasks that sit in a queue for minutes.

Track:

  • DAG parsing time.
  • Scheduler loop delay.
  • Queue time (task “scheduled” to “running”).
  • Worker saturation (CPU, memory, I/O).

If your queue time grows, your system feels slow even when tasks are “fast.”

Use Explicit Timeouts and Retries (and Justify Them)

Defaults drift into production and become policy by accident.

For each task, decide:

  • How long is too long? (execution_timeout)
  • How many retries make sense?
  • What failures are transient vs deterministic?

If a task fails deterministically (bad input), retries just burn time and page people later.

Make Validation a First-Class Stage

The highest ROI habit in ETL is validation that runs every time.

Examples:

  • Row-count deltas within expected bounds.
  • Non-null constraints on key fields.
  • Percentage of “unknown” categories.
  • Freshness checks (latest partition present, no missing partitions).

When validation is a separate task, you can alert on it, re-run it, and build trust in it.

Prefer a Warehouse-Centric Contract

As DAG count grows, tight coupling between pipelines becomes a failure amplifier. Use your database/warehouse as the contract boundary.

Practical steps:

  • Materialize intermediate tables with explicit schemas.
  • Version breaking changes (new table, new schema) instead of in-place mutation.
  • Keep “gold” models separate from ingestion tables.

This is how you reduce the number of systems you must reason about during a failure.

A Simple Operating Rule

If you can’t answer these questions quickly, your system is not yet operationally scalable:

  • What partition failed?
  • What changed since the last success?
  • Is the failure transient or deterministic?
  • Can I re-run safely without duplicates?
  • Can I backfill without “special steps”?

Closing Thought

The point of orchestration is not to run code. It’s to make data behavior predictable over time. If you build your DAGs so that “re-run” is safe and “backfill” is routine, scaling becomes a capacity problem again, not a stress problem.

  • Prefer small, composable tasks over monolithic operators.
  • Use idempotent transformations and explicit partitioning.
  • Measure scheduling latency and task queue contention early.