What the Modern Data Stack can teach us about the future of geospatial

There are a few truths you can always count on in geospatial.
Something is always changing.
There’s always a new way to do something.
Spatial is special.
Spatial data needs to be democratized.
Spatial data is in a silo.
Spatial data is a second class citizen.
I have heard so many of these over the years and many a presentation start with these phrases. And look I agree with a lot of these and have experienced them. It’s also not to say that we haven’t made inroads in many of these as well.
Join 6k+ others to get insights on geospatial, AI, and GIS every week👇
Nearly every business intelligence tool includes support for maps. PostGIS brought true spatial support to one of the most popular databases in the world. Geopandas brings spatial support to the Pandas API. By my last count there are close to SQL 50 tools that support geospatial data.
And yet for some reason, many in our industry find ourselves working uphill to try and drive adoption and get people to play in the spatial sandbox. You know it and I know it, and it is an uncomfortable and inconvenient truth: that geospatial and GIS professionals are usually on the outside looking in to the rest of the data world.
All is not lost though. There is one key thing that I think we need to look at, a set of conditions that would need to exist to make spatial data a necessity, not just a nice to have as it is in many places today. This post will take a dive into the conditions that saw data and analytics boom in the past 10 years and what needs to be done in spatial to make the same happen.
How the modern data stack was formed
There have only ever been a few sectors or industries where spatial data has been a firm need or requirement. Agriculture takes place on the land and spatial factors play a huge role in farming and analyzing that data at scale. Defense is another one for many reasons that I don’t think need to be listed here, but needless to say that GIS was basically born out of the defense sector. And many others like transportation, government, construction, planning, and others.
Analytics, and what I mean by that is business facing analytics has not been one of these historically. What I mean by analytics is all the practice areas and fields that have exploded in the 2010s and after with the increase in data being produced by mainly tech firms but increasingly all companies that allowed them to understand deeper insights about their products and customers.
This wave and increase in data gave rise to practice areas like data science, data engineering, analytics engineering, and others and also the tools and ecosystems needed to support this volume of data.
What was this data? Well increasingly companies started to collect a massive amount of data from the interactions customers were having in their products. This can be bcuketed up and called log data for simplicity but that doesn’t really do it justice.
Let’s take Airbnb for an example. Every time you log in, view a listing, book a listing, search for a listing, or basically take action in the platform, that action is being written to a dataset somewhere. Eventually all that data was used to create insights and value from it to better serve you and show you the right things in the platform and internally by the company to optimize their processes. This disparate data eventually was organized into data tables that can be joined and used and projects and infrastructure was stood up to support this. Open source tools like Apache Airflow and Apache Superset came out of Airbnb for these specific needs.


If you want to dig into this the Airbnb Tech Blog is an amazing living record of this, not just for data infrastructure but from a data science perspective too. And it’s not just at Airbnb but companies like Netflix, Facebook, and others were doing many of the same things. And eventually these tools became the backbone of the modern data stack, data science practices, and more.
So because of this it makes sense to me to look deeper into the rise of analytics to see if geospatial has a place here, and if so what does it look like?
The playbook, and what it means for spatial data
Over time, as this workflow started to evolve and expanded beyond some of the large tech companies in Silicon Valley, different companies started to take form around this to create products out of some of the open source projects that have been created.
Of course, it didn’t happen right away. There were some certain things that predicated these rapid advancements in analytics products. But let’s take a look at that workflow and see what exactly happened and what that means for us in the geospatial world.
- Volume of data increases and different data types are made available. Traditional relational databases struggled to handle this volume.
- Traditional tools like Hadoop and Spark showed that distributed computing was a possibility, but were difficult to manage an enterprise setting without the right expertise.
- Cloud services started to gain vast adoption. Storage became cheap and different tools were able to achieve the separation of compute and storage processing at scale and at cost.
- Teams were demanding data access beyond engineers and different companies such as Databricks and Snowflake capitalized upon this by commercializing these concepts or leveraging open source tools like Spark in the case of Databricks.
- Companies were happy to adopt this by leveraging easy to onboard tools and infrastructure-as-a-service or software-as-a-service based products.
- Data Lakes and more importantly the data lake house started to grab adoption. Table-based formats for cloud storage started to grow and a complete ecosystem built around these tools
And this is a oversimplification of the process and there’s a lot of detail that I skipped over, but the main point here is that all of this started out of a need for leveraging these types of data and insights outside of the traditional means and at large scale.
So how can we look at this process and see if there’s any parallels to the geospatial world today.
- Amounts of data increases and tooling is built around it to support that data.
- Open source projects start to roll out and form into different companies and products.
- The ability to separate compute and storage becomes commonplace.
- Teams are demanding access to this data and insights beyond the traditional groups using that.
- Demand grows for easy to onboard products.
- Leveraging a cloud storage-based infrastructure with separated compute and table-based formats becomes commonplace.
I think you can see that there are certainly some of these things taking place in our world today, but even at that point, there are still some things that are missing. So let’s break this down one by one and start to understand where we are and where there’s still room to grow.
So when is this coming to geospatial
Sure, geospatial data is growing
Sometimes it feels like a broken record saying that there are large amounts of geospatial data sitting around waiting to be used, but yes, that is the fact. When we actually look at what this data is, that can help tell us where we are going to go. So let’s take a look at some of the largest data sets that are starting to expand and continue to expand over time.
There are really two dimensions that I want to measure here. First is the current existing volume and the second is the potential for it to continuously expand. With that, I think there are really three areas that we need to focus in on that the traditional systems are not able to handle right now.
- Satellite imagery: This one should be a no-brainer, but there are satellites orbiting at Earth that are constantly collecting data. Lots of data, and they’re not going to stop doing it. In fact, they’re probably going to keep collecting more data. This data is, for the most part, sitting around and waiting to be used.
- GPS and/or IOT data: Things that are moving, collect data, and send it somewhere. You could look at this as vehicles that are transporting goods, both on land and in the water, delivery drivers, planes, wearables sending data while you exercise, you name it, anything that is in movement.
- Weather and climate: Weather happens, weather will continue to happen, the climate will continue to collect observations and measurements about what’s taking place. There’s a lot of it and it’s going to keep going.
As you can see, you have excluded a lot of different data in here. Things like census data, polygon-based data sets, and other data, certainly I think are large in scale, but don’t necessarily require the compute demand that was mirrored in the adoption of the modern data stack.
I also excluded internally generated data here. Things like raster or vector datasets that may be generated internally that are constantly being populated. This is certainly a case and one that is more of a one-off, but one that needs to be at least mentioned in this post.
Projects are being formed to work with this data
Now when you look at this data, this is one of the biggest gaps in the geospatial space currently. I think that there is a massive potential working with this type of data to integrate this into different business and analytics workflows.
On top of that, this is some of the toughest data to work with. If you look at a lot of portfolio projects, a lot of them ignore these types of datasets. In addition, if you also see a lot of data providers that are spinning up just to handle some of these or discrete problems and create companies centered around these types of datasets.
And on top of this, there are also projects that are being spun up to specifically manage and work with this data.
- Cloud-native geospatial formats
- Apache Sedona
- cuSpatial
- Dask
Each of these has different focus areas and different data types, but for the most part, they work with and solve the problems that some of the early data engineering innovations taking place in Silicon Valley companies look to address as well.
And of course we’re seeing companies form around these different projects. From Coiled with Dask to Wherobots with Apache Sedona, to Earthmover for climate data, there are a number of companies that are starting to form around these different toolkits.
The big players are just catching on but may not be the right fit
Over the next few weeks and months, you are likely to see a slew of announcements from some of the larger players in the last wave of the data and analytics space. Data bricks and Snowflake have adopted Apache Iceberg as an open table format. Increasingly, they’re also starting to add more spatial functionality into their platforms.
An Apache Iceberg has adopted spatial data types that will be coming in the V3 release of that library thanks to hard work of many people listed on this GitHub issue: https://github.com/apache/iceberg/issues/10260
Even Apache Spark is planning to add geospatial data type support into its core library. But there’s one tricky truth that we have to take a look at, and that is that these large data platforms that were grown out of this last round of innovation are primarily focused on tabular data, and in the geospatial world, that means vector data.
While there are plenty of use cases for large distributed processing for vector data, The three core data types that I listed above fundamentally are not vector data types. And to get them in that format that would require new formats or data transformation, which is not ideal given that those data sources will continue to turn out data in those formats.
Side note: Yes, I know that GPS data is effectively a latitude, longitude ping, but the processing required to work with that data, create trajectories, and do a number of different things, requires some different specialized techniques.
So I think that while these platforms will support geospatial data, I think that the other wave of libraries and tools that are already in this space or the ones that have yet to be created are better suited for these different problems.
People want this data
I know the geospatial has always been one of those things that you have to advocate for, but we may be seeing a point in time where that doesn’t become the commonplace anymore, especially with these data types.
If there is one thing to take away from this article here it is:
Many companies big and small want to leverage this type of data that they’re collecting. They need to use satellite imagery to collect and analyze features or understand the way the world is changing. They need climate data to inform their operations and make better decisions. And they need to leverage the GPS and mobility data that they’re collecting to understand how things are moving in real time and how to optimize them better in the future.
If you don’t believe me, take a look around Google some of the different large tech companies that are trying to work with and handle optimization problems, looking to integrate climate and weather data into the things that they’re doing. And I think you’ll start to see some interesting patterns.
And increasingly they’re going to want it in the formats and tools that they want to use. Thus, that is why open table formats like Apache Iceberg, Delta Lake, and the newly announced DuckLake will be so important.
And if history tells us anything, it’s that companies will want to leverage these tools in easy to onboard infrastructure-as-a-service and software-as-a-service platforms. The one caveat here is that the future has already been written to some degree about how they will want this to integrate and the tools that you want to use. Cloud-based storage, separated compute, and open table formats. If that wasn’t clear already.
So what can we learn from this
So what does that mean for our world and what should we be looking at as focus areas for the things that we’re working on?
- Large-scale companies are smart, and they don’t need us to democratize geospatial for them. In fact, many of them have already been doing it due to the ecosystem that is already in place.
- However, some of these formats and data types that are listed above are particularly tricky and require different techniques than the existing tooling.
- If the demand is there and I do believe that it is, then it will become a business critical operation and there will need to be systems and tools to support that.
- Open source is where this stuff starts, and I think that has already happened and will continue to happen. So let’s look there as the first place to see innovation and productization take place.
- Tooling, infrastructure, and techniques to extract insights from raw imagery, weather and climate data, and to effectively process and draw insights from mobility data at scale are three key areas to focus in on.
- The market for data scientists boomed around the time these tools took off the first time, and if that’s any indication of what’s going to take place, I will keep my eyes on the demand for these types of roles around working with imagery and turning it into proper insights by extracting different features, leveraging weather and climate data to integrate into existing data stackm and tools and systems that can effectively and properly handle the sheer volume of GPS and mobility data being created.
- I didn’t mention AI, but that will be a massive consumer of this data as well. The first time around it was primarily human, this time around it’ll be human and machines that need to work with this data.
To sum it up, I think there will be a lot of winners coming out of this boom. The playbook’s already been written to some degree. Those that can take and scale solutions around these types of problems and bring that to many organizations to use effectively and quickly can help paint the future for the way we work with geospatial data in our world.
This was a pretty high level post, but I want to dive into some of these topics in more detail. Here’s a short list of some of the things that I’m going to be working on, both in blog format and in video format.
- Cloud storage and why it matters
- Cloud native formats
- Geoparquet
- COG and STAC
- Zarr
- Managing data pipelines is Step 1
- Data Lakes, the Lakehouse, Iceberg, and table formats
- Consuming data in query engines (DuckDB, PostGIS, Apache Sedona, Trino, etc.)
- Pros and cons to adopting the modern data stack to geospatial
- What about Google Earth Engine?
- The open source systems leading the way
- Cloud: When and Why
- The extensible Iceberg catalog (polaris)
- The three things you need: Processing, Query Engine, Application Layer
- Making machine learning and deep learning easy
- Connecting with LLMs