Scaling GIS Workflows with COGs, Airflow, and Apache Iceberg

TOP OF THE STACK
What we need to do with COGs
COGs (Cloud-Optimized GeoTIFFs) are one of the most promising tools we have for making raster data truly cloud-native. They let you stream just the pieces you need, work remotely, and plug into modern geospatial systems without downloading giant files. But after working closely with them over the past few weeks, I’ve run into a few critical gaps that are holding the ecosystem back:
1. We need a better way to explore COGs—especially in QGIS or ArcGIS.
If your raster data lives in a STAC catalog (as it should), and that catalog points to COGs on cloud storage, you’re probably going to end up in Python with rasterio
. From there, it’s actually pretty slick to read a window or grab metadata. But outside Python? It gets clunky fast. There’s no great QGIS or ArcGIS plugin yet that makes this intuitive – something like a “STAC browser for rasters” that lets you preview tiles and stream them directly into your map without downloading the whole thing. That tool needs to exist.
This is one of my weekly newsletters. To get a copy sent right to your inbox subscribe below!
2. COGs are still a pain to make.
I found this fascinating global dataset that tracks human modification from 1990 to 2017 in 5-year intervals—sliced into dozens of rasters on Google Cloud. It’s in Earth Engine, but not optimized for broader use. So I brute-forced them into COGs using GDAL. Not elegant, but it worked.
Even Earth Engine and Carto recommend doing it this way. That tells you something: we still don’t have great, opinionated tooling for going from raw raster → validated, STAC-ready, cloud-hosted COG. We’re getting there but there is more to do.
3. Without better training and clearer best practices, this stuff stays locked in silos.
Right now, a lot of great raster data lives in Earth Engine, and that’s fine. But if we want to build an open, interoperable ecosystem, we need to make it easier to extract that data, convert it into COGs, and publish it to cloud storage with metadata that fits into STAC. That workflow is still opaque for many people. And unless we make it easier, most of this valuable data stays behind closed APIs.
The good news? We’re not far off. A few small improvements—better plugins, clearer tooling, a handful of battle-tested workflows—could unlock a whole new level of access and scalability.
AIRFLOW
Why Airflow Isn’t Just for Big Tech
If you’ve ever stitched together a data pipeline with cron jobs, Jupyter notebooks, and a few Python scripts duct-taped with bash… you’re not alone.
Airflow might seem like overkill at first, until you hit that moment when things start breaking, or worse, silently stop running.
Even outside the geospatial world, Airflow has become the go-to orchestrator for one simple reason: it solves problems every data team runs into.
Here are the non-spatial reasons why people use Airflow—and why it’s showing up everywhere from fintech to climate analytics:
1. Scheduling without chaos
Airflow replaces fragile cron jobs with readable, version-controlled DAGs. You can define complex schedules, dependencies, and retries in Python, not by guessing at timestamps.
2. Dependency management that actually makes sense
Your task doesn’t run until the one before it succeeds. It’s simple, but powerful. No more worrying about your script running on incomplete data or half-processed files.
3. Observability built in
Airflow gives you a visual DAG, logs for every task, and retry buttons when something goes sideways. You know exactly what ran, when, and why it failed, without SSH-ing into a random VM or EC2 box.
4. Modularity and reusability
Each task is just a function. Want to swap out the source from Postgres to S3? Easy. Want to run the same logic across 50 datasets? Done.
5. It plays nicely with everything
Airflow isn’t opinionated about what you’re running. Bash, Python, Spark, SQL, cloud functions, if you can script it, you can run it in Airflow.
Airflow isn’t just about “big data” or “data engineering.” It’s about making workflows predictable, observable, and maintainable—three things that matter whether you’re crunching building footprints or syncing sales data from Salesforce.
Let me know if you’d like to plug in the spatial-specific reasons next or a tutorial. I can follow this up with the exact reasons Airflow works so well for geospatial workflows.
NEW VIDEO/PODCAST
Is this the worst job in GIS?
A recent GIS job posting made waves online: a contract role at one of the world’s biggest tech companies, Apple, offering just $20–$22 an hour to update and maintain spatial data. For a job that requests skills in ArcGIS, QGIS, and Python and SQL, that’s not just underwhelming, it’s a signal.
In the video, I break down why this isn’t just about one bad listing, it’s about the broader pay disparity in GIS and a structural issue I call the “technician trap.”
Too many early-career GIS professionals are stuck in roles where value is tied to tasks, not outcomes. You’re paid to complete data updates, not to contribute strategically to a business. That mindset is limiting, and it’s everywhere.
Contrast that with another job I featured in the same video, also at Apple, offering up to $300K+ for geospatial data engineering. The key difference? Strategic value. Same domain, same datasets, but one role builds systems, collaborates across teams, automates processes, and drives insight at scale.
What’s the takeaway? If you’re trying to grow your GIS career, it’s not just about learning more tools, it’s about positioning. Are you a technician or a strategic partner? Are you automating workflows or repeating manual tasks? Are you solving problems for others or just executing requests?
The good news is you can shift that narrative. In fact, I built the Modern GIS Accelerator to help people do just that—learn modern tools and reframe how they talk about their work.
The next time you see a job like this, don’t get discouraged. Get curious. Ask: How can I grow from here to where I want to be? And more importantly—who actually values what I bring to the table?
LEARN ONE THING
Apache Iceberg
Apache Iceberg isn’t just for big data anymore, it’s officially gone geospatial. With the latest updates, both Iceberg and Parquet now support spatial types, which means you can run scalable spatial analytics using open table formats built for the cloud.
This changes the game for modern GIS workflows. Think versioned spatial datasets, time travel queries, and lightning-fast reads on massive GeoParquet files, all in a way that’s interoperable and vendor-neutral.
To get you up to speed check out this great intro from The Data Guy on YouTube.
Read the full announcement from Wherobots on launching geospatial support in Iceberg here as well.
If you care about scaling geospatial, this is one update you can’t ignore.
A quick note: I hope you like this new format for a weekly newsletter. If so just hit reply and let me know – it helps to hear your feedback.