• Posts
  • Spatial Lab
  • Get Certified
  • Modern GIS Accelerator
  • The Spatial SQL Book

Get the newsletter

Join 71,000+ geospatial experts growing their skills and careers. Get updates on the most cutting edge updates in modern GIS and geospatial every week.

Edit Content
  • LinkedIn
  • YouTube
Article

Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

April 25, 2025 Matt Forrest Comments Off on Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

TOP OF THE STACK

What we need to do with COGs

COGs (Cloud-Optimized GeoTIFFs) are one of the most promising tools we have for making raster data truly cloud-native. They let you stream just the pieces you need, work remotely, and plug into modern geospatial systems without downloading giant files. But after working closely with them over the past few weeks, I’ve run into a few critical gaps that are holding the ecosystem back:

1. We need a better way to explore COGs—especially in QGIS or ArcGIS.
If your raster data lives in a STAC catalog (as it should), and that catalog points to COGs on cloud storage, you’re probably going to end up in Python with rasterio. From there, it’s actually pretty slick to read a window or grab metadata. But outside Python? It gets clunky fast. There’s no great QGIS or ArcGIS plugin yet that makes this intuitive – something like a “STAC browser for rasters” that lets you preview tiles and stream them directly into your map without downloading the whole thing. That tool needs to exist.

This is one of my weekly newsletters. To get a copy sent right to your inbox subscribe below!

2. COGs are still a pain to make.​
I found this fascinating global dataset that tracks human modification from 1990 to 2017 in 5-year intervals—sliced into dozens of rasters on Google Cloud. It’s in Earth Engine, but not optimized for broader use. So I brute-forced them into COGs using GDAL. Not elegant, but it worked.

Even Earth Engine and Carto recommend doing it this way. That tells you something: we still don’t have great, opinionated tooling for going from raw raster → validated, STAC-ready, cloud-hosted COG. We’re getting there but there is more to do.

3. Without better training and clearer best practices, this stuff stays locked in silos.
Right now, a lot of great raster data lives in Earth Engine, and that’s fine. But if we want to build an open, interoperable ecosystem, we need to make it easier to extract that data, convert it into COGs, and publish it to cloud storage with metadata that fits into STAC. That workflow is still opaque for many people. And unless we make it easier, most of this valuable data stays behind closed APIs.

The good news? We’re not far off. A few small improvements—better plugins, clearer tooling, a handful of battle-tested workflows—could unlock a whole new level of access and scalability.

AIRFLOW

Why Airflow Isn’t Just for Big Tech

If you’ve ever stitched together a data pipeline with cron jobs, Jupyter notebooks, and a few Python scripts duct-taped with bash… you’re not alone.

Airflow might seem like overkill at first, until you hit that moment when things start breaking, or worse, silently stop running.

Even outside the geospatial world, Airflow has become the go-to orchestrator for one simple reason: it solves problems every data team runs into.

Here are the non-spatial reasons why people use Airflow—and why it’s showing up everywhere from fintech to climate analytics:

1. Scheduling without chaos​
Airflow replaces fragile cron jobs with readable, version-controlled DAGs. You can define complex schedules, dependencies, and retries in Python, not by guessing at timestamps.

2. Dependency management that actually makes sense​
Your task doesn’t run until the one before it succeeds. It’s simple, but powerful. No more worrying about your script running on incomplete data or half-processed files.

3. Observability built in​
Airflow gives you a visual DAG, logs for every task, and retry buttons when something goes sideways. You know exactly what ran, when, and why it failed, without SSH-ing into a random VM or EC2 box.

4. Modularity and reusability​
Each task is just a function. Want to swap out the source from Postgres to S3? Easy. Want to run the same logic across 50 datasets? Done.

5. It plays nicely with everything​
Airflow isn’t opinionated about what you’re running. Bash, Python, Spark, SQL, cloud functions, if you can script it, you can run it in Airflow.

Airflow isn’t just about “big data” or “data engineering.” It’s about making workflows predictable, observable, and maintainable—three things that matter whether you’re crunching building footprints or syncing sales data from Salesforce.

Let me know if you’d like to plug in the spatial-specific reasons next or a tutorial. I can follow this up with the exact reasons Airflow works so well for geospatial workflows.

NEW VIDEO/PODCAST

Is this the worst job in GIS?

A recent GIS job posting made waves online: a contract role at one of the world’s biggest tech companies, Apple, offering just $20–$22 an hour to update and maintain spatial data. For a job that requests skills in ArcGIS, QGIS, and Python and SQL, that’s not just underwhelming, it’s a signal.

In the video, I break down why this isn’t just about one bad listing, it’s about the broader pay disparity in GIS and a structural issue I call the “technician trap.”

Too many early-career GIS professionals are stuck in roles where value is tied to tasks, not outcomes. You’re paid to complete data updates, not to contribute strategically to a business. That mindset is limiting, and it’s everywhere.

Contrast that with another job I featured in the same video, also at Apple, offering up to $300K+ for geospatial data engineering. The key difference? Strategic value. Same domain, same datasets, but one role builds systems, collaborates across teams, automates processes, and drives insight at scale.

What’s the takeaway? If you’re trying to grow your GIS career, it’s not just about learning more tools, it’s about positioning. Are you a technician or a strategic partner? Are you automating workflows or repeating manual tasks? Are you solving problems for others or just executing requests?

The good news is you can shift that narrative. In fact, I built the Modern GIS Accelerator to help people do just that—learn modern tools and reframe how they talk about their work.

The next time you see a job like this, don’t get discouraged. Get curious. Ask: How can I grow from here to where I want to be? And more importantly—who actually values what I bring to the table?

LEARN ONE THING

Apache Iceberg

Apache Iceberg isn’t just for big data anymore, it’s officially gone geospatial. With the latest updates, both Iceberg and Parquet now support spatial types, which means you can run scalable spatial analytics using open table formats built for the cloud.

This changes the game for modern GIS workflows. Think versioned spatial datasets, time travel queries, and lightning-fast reads on massive GeoParquet files, all in a way that’s interoperable and vendor-neutral.

To get you up to speed check out this great intro from The Data Guy on YouTube.

​video preview​

​
Read the full announcement from Wherobots on launching geospatial support in Iceberg here as well.

If you care about scaling geospatial, this is one update you can’t ignore.

A quick note: I hope you like this new format for a weekly newsletter. If so just hit reply and let me know – it helps to hear your feedback.

  • Apache Airflow
  • Apache Iceberg
  • Cloud Optimized GeoTIFF
  • Spatial SQL
Matt Forrest

Post navigation

Previous
Next

Search

Categories

  • Article (27)
  • Essay (1)
  • Podcast (5)
  • Tutorial (5)

Recent posts

  • Esri vs Open Source GIS: The Real Debate Behind the Tools
  • Cloud Native Geospatial Formats: GeoParquet, Zarr, COG, and PMTiles Explained
  • Airflow + AI + Iceberg V3: The New Stack for Scalable Geospatial Data

Tags

aggregations Apache Airflow Apache Iceberg Apache Sedona ArcGIS bigquery Cloud-Native Geospatial Cloud GIS Cloud Optimized GeoTIFF duckdb Esri geoparquet geospatial gis GISP Modern GIS postgis Python snowflake Spatial SQL sql Wherobots zip codes

Related posts

Podcast

Breaking the GIS Silo: Why GeoParquet and Iceberg are the key to Spatial Analytics at Scale

July 3, 2025 Matt Forrest Comments Off on Breaking the GIS Silo: Why GeoParquet and Iceberg are the key to Spatial Analytics at Scale

For decades, GIS has lived in a world of its own. Specialized software. Obscure formats. A profession that, despite being critical to everything from climate modeling to logistics, has remained siloed from the rest of the data world. But that’s finally starting to change. In the latest episode of the Spatial Stack podcast, I sat […]

Article

How to Build a Cloud-Native Spatial Data Lakehouse

July 3, 2025 Matt Forrest Comments Off on How to Build a Cloud-Native Spatial Data Lakehouse

Most spatial workflows today are still running on a foundation of folders, flat files, and fragile scripts. You’ve probably worked with shapefiles stored in six different places, Python notebooks that quietly break when a column changes, and a dozen versions of the same dataset ending in _final_v2_edit.shp. I’ve been there. But spatial data has changed. […]

Article

From Desktop GIS to Cloud: A Beginner’s Roadmap to Modern GIS Tool

March 7, 2025 Matt Forrest Comments Off on From Desktop GIS to Cloud: A Beginner’s Roadmap to Modern GIS Tool

Modern GIS is changing fast. If you’ve been working with QGIS, ArcGIS, or any other desktop GIS tool, you’ve probably hit some limitations—datasets getting too big, processing times slowing down, and collaboration becoming a challenge. The good news? The cloud offers a way forward. But how do you make that transition? How do you go […]

Spatial Lab
  • Join the Spatial Lab community
Policies
  • Privacy Policy
  • Terms & Conditions
Spatial SQL
  • Get the Spatial SQL book today
Join Us

© Matt Forrest 2024. All Rights Reserved.