Uncategorized

Why your spatial pipelines are probably broken (And how Airflow can help)

In the world of spatial data, building pipelines often feels like duct-taping together cron jobs, brittle scripts, and whatever cloud service happens to be in reach. Whether you’re transforming geospatial imagery, generating features for machine learning, or managing time-sensitive datasets like weather and river gauges chances are, your pipelines could use an upgrade.

In the latest episode of the Spatial Stack Podcast, I sat down with Kenten Danas, Senior Manager of Developer Relations at Astronomer, to talk about the one tool that’s quietly become the backbone of modern data workflows: Apache Airflow.

“You’re almost never going to have just the LLM task, you’re probably starting way earlier than that with raw data from a bunch of different sources.”

Kenton brings a unique perspective to this conversation not just as a core contributor to the Airflow ecosystem, but as someone deeply focused on how people actually build and scale data pipelines in the real world.


What Is Apache Airflow, Really?

Airflow is an open-source orchestrator built to manage complex workflows. Originally developed at Airbnb in 2014, it has since grown into a top-level Apache project with over 30 million downloads per month. At its core, Airflow lets you define your data pipelines (called DAGs) using Python, schedule them flexibly, and monitor their performance with rich observability tools.

That makes it ideal not just for analytics, but also for MLOps, LLM pipelines, and geospatial batch processing.


Airflow in the Spatial World

One of my favorite parts of the conversation was hearing about how Vibrant Planet, a company working on wildfire risk analysis, uses Airflow to orchestrate predictive modeling over complex geospatial datasets. Their workflows ingest raw imagery and spatial data, break it into chunks of varying size, and dynamically assign the right compute resources depending on complexity.

“They’re using dynamic task mapping and KubernetesPodOperators to predict the resource needs for each area because the size and complexity of spatial data varies so much.”

This kind of flexibility is essential when working with spatial workloads. You don’t want to over-allocate expensive GPU time to a basic reprojection job. You also can’t afford constant retries due to memory overloads on heavier models.

Airflow gives teams the ability to control where, how, and when each task runs — while providing one central place to monitor everything.


Airflow 3.0: Major Upgrades for AI and Spatial Workloads

Kenton walked through some of the most exciting new features in Airflow 3.0, including:

  • Remote execution: Run compute-heavy tasks (like ML model inference or geospatial feature extraction) on GPU clusters without deploying Airflow itself there.
  • Assets: Define event-triggered workflows that kick off when a file arrives or an external signal is received — perfect for irregular, streaming spatial datasets.
  • Backfills & DAG versioning: Reprocess historical runs cleanly and track changes to your pipelines over time.

These features push Airflow beyond traditional batch ETL into a broader orchestration role for AI agents, LLMs, and streaming spatial data.


LLMs + Orchestration = The Next Frontier

We also explored how Airflow is becoming a key part of the LLM/agentic workflow stack — especially when paired with Astronomer’s new AI SDK.

“Agents might handle one piece of the workflow but they’re a black box. Airflow gives you the observability, reliability, and control across the whole stack.”

This orchestration layer becomes even more important when you’re dealing with large-scale, structured data — like spatial arrays, raster tiles, or vector features — that feed into downstream AI pipelines.


Getting Started

If you’re new to Airflow, Kenton recommends:

  • Using the Astro CLI to run Airflow locally for fast experimentation
  • Checking out Astronomer Academy for free courses and certifications
  • Exploring Airflow 3.0 if you’re building new pipelines or migrating existing ones

I didn’t know about the CLI until this conversation — and it’s a game-changer for local dev.


Final Thoughts

If you’ve ever stitched together a geospatial data pipeline, struggled with failed cron jobs, or wanted more control over your AI workflows — this episode is worth your time.

Airflow isn’t just for data engineers anymore. It’s becoming a central layer for anyone working with spatial data, machine learning, or event-driven analysis — especially as use cases get more complex.

https://open.spotify.com/embed/show/5vPpnDqdk4K2xuwqNbF6Gx?utm_source=generator