• Posts
  • Spatial Lab
  • Modern GIS Accelerator
  • The Spatial SQL Book

Get the newsletter

Join 71,000+ geospatial experts growing their skills and careers. Get updates on the most cutting edge updates in modern GIS and geospatial every week.

Edit Content
  • LinkedIn
  • YouTube
Article

Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

April 25, 2025 Matt Forrest Comments Off on Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

TOP OF THE STACK

What we need to do with COGs

COGs (Cloud-Optimized GeoTIFFs) are one of the most promising tools we have for making raster data truly cloud-native. They let you stream just the pieces you need, work remotely, and plug into modern geospatial systems without downloading giant files. But after working closely with them over the past few weeks, I’ve run into a few critical gaps that are holding the ecosystem back:

1. We need a better way to explore COGs—especially in QGIS or ArcGIS.
If your raster data lives in a STAC catalog (as it should), and that catalog points to COGs on cloud storage, you’re probably going to end up in Python with rasterio. From there, it’s actually pretty slick to read a window or grab metadata. But outside Python? It gets clunky fast. There’s no great QGIS or ArcGIS plugin yet that makes this intuitive – something like a “STAC browser for rasters” that lets you preview tiles and stream them directly into your map without downloading the whole thing. That tool needs to exist.

This is one of my weekly newsletters. To get a copy sent right to your inbox subscribe below!

2. COGs are still a pain to make.​
I found this fascinating global dataset that tracks human modification from 1990 to 2017 in 5-year intervals—sliced into dozens of rasters on Google Cloud. It’s in Earth Engine, but not optimized for broader use. So I brute-forced them into COGs using GDAL. Not elegant, but it worked.

Even Earth Engine and Carto recommend doing it this way. That tells you something: we still don’t have great, opinionated tooling for going from raw raster → validated, STAC-ready, cloud-hosted COG. We’re getting there but there is more to do.

3. Without better training and clearer best practices, this stuff stays locked in silos.
Right now, a lot of great raster data lives in Earth Engine, and that’s fine. But if we want to build an open, interoperable ecosystem, we need to make it easier to extract that data, convert it into COGs, and publish it to cloud storage with metadata that fits into STAC. That workflow is still opaque for many people. And unless we make it easier, most of this valuable data stays behind closed APIs.

The good news? We’re not far off. A few small improvements—better plugins, clearer tooling, a handful of battle-tested workflows—could unlock a whole new level of access and scalability.

AIRFLOW

Why Airflow Isn’t Just for Big Tech

If you’ve ever stitched together a data pipeline with cron jobs, Jupyter notebooks, and a few Python scripts duct-taped with bash… you’re not alone.

Airflow might seem like overkill at first, until you hit that moment when things start breaking, or worse, silently stop running.

Even outside the geospatial world, Airflow has become the go-to orchestrator for one simple reason: it solves problems every data team runs into.

Here are the non-spatial reasons why people use Airflow—and why it’s showing up everywhere from fintech to climate analytics:

1. Scheduling without chaos​
Airflow replaces fragile cron jobs with readable, version-controlled DAGs. You can define complex schedules, dependencies, and retries in Python, not by guessing at timestamps.

2. Dependency management that actually makes sense​
Your task doesn’t run until the one before it succeeds. It’s simple, but powerful. No more worrying about your script running on incomplete data or half-processed files.

3. Observability built in​
Airflow gives you a visual DAG, logs for every task, and retry buttons when something goes sideways. You know exactly what ran, when, and why it failed, without SSH-ing into a random VM or EC2 box.

4. Modularity and reusability​
Each task is just a function. Want to swap out the source from Postgres to S3? Easy. Want to run the same logic across 50 datasets? Done.

5. It plays nicely with everything​
Airflow isn’t opinionated about what you’re running. Bash, Python, Spark, SQL, cloud functions, if you can script it, you can run it in Airflow.

Airflow isn’t just about “big data” or “data engineering.” It’s about making workflows predictable, observable, and maintainable—three things that matter whether you’re crunching building footprints or syncing sales data from Salesforce.

Let me know if you’d like to plug in the spatial-specific reasons next or a tutorial. I can follow this up with the exact reasons Airflow works so well for geospatial workflows.

NEW VIDEO/PODCAST

Is this the worst job in GIS?

A recent GIS job posting made waves online: a contract role at one of the world’s biggest tech companies, Apple, offering just $20–$22 an hour to update and maintain spatial data. For a job that requests skills in ArcGIS, QGIS, and Python and SQL, that’s not just underwhelming, it’s a signal.

In the video, I break down why this isn’t just about one bad listing, it’s about the broader pay disparity in GIS and a structural issue I call the “technician trap.”

Too many early-career GIS professionals are stuck in roles where value is tied to tasks, not outcomes. You’re paid to complete data updates, not to contribute strategically to a business. That mindset is limiting, and it’s everywhere.

Contrast that with another job I featured in the same video, also at Apple, offering up to $300K+ for geospatial data engineering. The key difference? Strategic value. Same domain, same datasets, but one role builds systems, collaborates across teams, automates processes, and drives insight at scale.

What’s the takeaway? If you’re trying to grow your GIS career, it’s not just about learning more tools, it’s about positioning. Are you a technician or a strategic partner? Are you automating workflows or repeating manual tasks? Are you solving problems for others or just executing requests?

The good news is you can shift that narrative. In fact, I built the Modern GIS Accelerator to help people do just that—learn modern tools and reframe how they talk about their work.

The next time you see a job like this, don’t get discouraged. Get curious. Ask: How can I grow from here to where I want to be? And more importantly—who actually values what I bring to the table?

LEARN ONE THING

Apache Iceberg

Apache Iceberg isn’t just for big data anymore, it’s officially gone geospatial. With the latest updates, both Iceberg and Parquet now support spatial types, which means you can run scalable spatial analytics using open table formats built for the cloud.

This changes the game for modern GIS workflows. Think versioned spatial datasets, time travel queries, and lightning-fast reads on massive GeoParquet files, all in a way that’s interoperable and vendor-neutral.

To get you up to speed check out this great intro from The Data Guy on YouTube.

​video preview​

​
Read the full announcement from Wherobots on launching geospatial support in Iceberg here as well.

If you care about scaling geospatial, this is one update you can’t ignore.

A quick note: I hope you like this new format for a weekly newsletter. If so just hit reply and let me know – it helps to hear your feedback.

  • Apache Airflow
  • Apache Iceberg
  • Cloud Optimized GeoTIFF
  • Spatial SQL
Matt Forrest

Post navigation

Previous
Next

Search

Categories

  • Article (23)
  • Tutorial (5)

Recent posts

  • How to Run Scalable Geospatial Analysis with Apache Sedona – Right From Your Laptop
  • Geospatial Tools Compared: When to Use GeoPandas, PostGIS, DuckDB, Apache Sedona, and Wherobots
  • Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

Tags

aggregations Apache Airflow Apache Iceberg Apache Sedona bigquery Cloud GIS Cloud Optimized GeoTIFF duckdb geoparquet geospatial gis Modern GIS postgis Python snowflake Spatial SQL sql zip codes

Related posts

Article

From Desktop GIS to Cloud: A Beginner’s Roadmap to Modern GIS Tool

March 7, 2025 Matt Forrest Comments Off on From Desktop GIS to Cloud: A Beginner’s Roadmap to Modern GIS Tool

Modern GIS is changing fast. If you’ve been working with QGIS, ArcGIS, or any other desktop GIS tool, you’ve probably hit some limitations—datasets getting too big, processing times slowing down, and collaboration becoming a challenge. The good news? The cloud offers a way forward. But how do you make that transition? How do you go […]

Article

The Top 11 Open GeoParquet Datasets: Making big geospatial data easy

January 18, 2024 Matt Forrest Comments Off on The Top 11 Open GeoParquet Datasets: Making big geospatial data easy

In the dynamic field of geospatial technology, the evolution of data formats plays a pivotal role in shaping how we interact with and interpret spatial information. The advent of GeoParquet has marked a significant milestone, offering a more efficient and accessible way to handle large spatial datasets. This blog post delves into a comprehensive exploration […]

Article

Mastering Spatial SQL: The Ultimate Guide to Tools & Databases in 2024

January 11, 2024 Matt Forrest Comments Off on Mastering Spatial SQL: The Ultimate Guide to Tools & Databases in 2024

The Evolving Landscape of Spatial SQL In today’s technologically driven world, the use of spatial SQL in databases and data warehouses has become increasingly significant. This growth not only enriches the database management landscape but also simplifies the learning curve for professionals, allowing them to apply their expertise across various tools. This blog post delves […]

Spatial Lab
  • Join the Spatial Lab community
Courses
  • Learn Modern GIS with courses and certifications
Spatial SQL
  • Get the Spatial SQL book today
Join Us

© Matt Forrest 2024. All Rights Reserved.