Article

🚫 Hype vs. reality: The big data debate in geospatial

January 10, 2025 Matt Forrest Comments Off

Was big data in GIS ever a thing?

When we talk about big data in geospatial and GIS, it tends to open up quite the debate.

On one hand, you have the promise of detailed insights and the ability to understand our world in ways we’ve never imagined.

On the other, there’s the stark reality that not every geospatial use case requires the heavy artillery of big data solutions.

In the data space, geospatial and otherwise, many have fallen into the trap of the wave of big data hype. There’s this prevailing notion that large datasets are the key to solving all analytical challenges.

Yet, as Jordan Tigani eloquently pointed out in his piece “Big Data is Dead,” this narrative often misses the mark. Not every organization is drowning in data so vast that it necessitates distributed systems like Hadoop or Spark.

In fact, a surprising number of businesses have data needs that fall comfortably within the capabilities of more traditional architectures. Plenty of businesses handle data well under 100GB for their usual workloads (workloads being analytical use cases, not total data stored).

Sure, there are outliers with petabytes stored away, but even these giants seldom process more than a small subset for their queries.

The allure of big data often overshadows the practicality of these traditional systems. Why fix what isn’t broken? For many, they are better off leveraging a simple architecture for their geospatial data needs.

The thing that is changing is that everyone can take advantage of the best of both worlds – the architecture of the “big data systems” can still benefit even modest data volumes. More on this later…

At its core, geospatial data can be categorized into two primary types: raster and vector (many of you know this but it bears repeating for new readers). Each comes with its own set of characteristics and challenges.

Raster data, such as satellite imagery, is pixel-based and often involves large datasets given its high-resolution detail. When I say pixels I mean literal pixels as these are in fact images. IMO raster data at a planetary level is the original big data in the spatial world. And images are surprisingly efficient at storing this data.

On the other hand, vector data deals with points, lines, and polygons to represent features like roads, boundaries, and cities. As a data type this can actually get quite large (more on that later).

It is here that we see one major issue: integrating these two types of data to extract valuable insights is no small feat. Sure clipping a raster by a polygon is a trivial task, but zonal stats on multiple polygons with data changing over time? That can be tricky.

For those of you outside of geospatial what this is akin to is joining an image to items in a table. Imagine if you had to identify a dog in an image and then join that to a dog in the table you have. Sure one is easy, but how do you do that over and over? Identify the breed of the dog? This is more or less the same thing but instead of being a semantic relationship, it is spatial in this case.

On top of this raster data often requires high-performance computing power, utilizing GPUs and specialized hardware to handle the sheer volume and complexity. Meanwhile, vector data can sometimes be managed with less demanding resources, but it still requires robust analytical tools for effective processing at scale.

I think it’s critical to acknowledge these inherent challenges. When you’re working with geospatial data, you’re often dealing with large file sizes as the data grows in complexity or length. A single complex polygon can consume several megabytes, dwarfing typical data types that remain in the realm of bytes.

But why does this matter? Well, the integration of raster and vector data provides a fuller picture, enabling more comprehensive analyses. Imagine you are in construction, urban planning, or insurance and you want to understand which buildings are prone to flash flood risk. You would need to use raster data that contains this information (this dataset) and join it to however many buildings you need. If you are curious you can see how I did that here:

The catch is that as other analytical tools develop and mature, and the ability to get more instantaneous results back from analytical tools increases, that demand is going to become an expectation for any data process, geospatial included. Nobody wants to wait hours for a query to process or for data to move from one system to another. This is where modern data stacks come into play, streamlining the process and making real-time analysis more feasible.

Why do we even care about big data in geospatial analytics? Imagine trying to understand climate patterns without access to comprehensive datasets or attempting urban development without detailed maps and property data. That’s where big data steps in – often times the data we really want is inside these larger datasets, so we have to be able to use them even for a more modestly sized project.

Take, for example, property parcels and building footprints. These datasets are massive, not just in size but in the granularity of information they offer. We’re talking about millions of parcels, each with unique attributes like zoning regulations, land use, and ownership details. Analyzing these requires not just storage but processing power that can handle complex queries and deliver results efficiently.

Here is a challenge for you: take a set of property parcels and try and calculate the land value for K-nearest neighbors for every parcel.

Moving on to raster data, the size and constant change in that data is another challenge, not to mention the multitude of ways the source data is stored (I recently had a fun experience with the snow depth dataset – SNODAS – from the National Snow and Ice Data Center which is stored on an FTP in g’zipped .dat files). Google Earth Engine made its name by solving this issue. Not only storing the data but also providing simple tools to access any time slice of the data.

But let’s imagine a world where you don’t need to move this data around. You can use multiple platforms depending on your issue and bring your computational framework to the data, not the other way around.

I want to be really clear here – this is what we mean when we say cloud-native geospatial. It’s not just big data, but the way we interact with it and ultimately making it easier to access.

The modern data stack, with its ability to integrate various tools and platforms seamlessly, actually makes it a reality to pick the parts you need that are causing you the most headache.

Technologies like Apache Iceberg, DuckDB, and Trino have revolutionized how organizations store and query data. They facilitate efficient data management, allowing organizations to perform complex analyses without the traditional bottlenecks of data movement and processing delays.

But here’s the kicker: the shift towards cloud-native solutions has further enhanced this capability. With data increasingly residing in the cloud, the need to physically move data from one system to another diminishes. Instead, data can be accessed and processed where it lives, reducing latency and improving efficiency.

Let’s take a look at how this evolved with a short case study of the modern data stack. It’s a paradigm shift in how data is managed, processed, and analyzed, offering a streamlined approach that minimizes the friction often encountered with traditional systems.

Take Apache Iceberg, for instance. It’s a high-performance table format for huge analytics datasets, designed to handle petabyte-scale data with ease. Iceberg simplifies the management of large datasets, providing a clear structure and efficient querying capabilities, which is crucial for handling the vast amounts of geospatial data generated today.

The best part is that it is all just files in a cloud storage bucket. Nothing more, nothing less.

As for DuckDB, what I love about it is its ability to execute complex queries directly at the source, reducing the need for data movement. It’s fast, scalable, and can handle substantial analytical queries without the overhead typically associated with larger systems.

With Apache Sedona and Wherobots, Sedona provides the functional layer to work with spatial data and the compute layer, be it Spark, Wherobots Cloud, Snowflake, Apache Flink, Databricks, etc., the compute is brought to the storage location. This could be your cloud storage or public data such as that stored in AWS Earth.

Take a look at the graphic below.

What is great is that when you define your approach to any one of these areas, you can then choose to scale these parts up or down as needed. Or if you choose change the approach. You are never locked in to any one thing since you have created a modular approach across your systems.

For those in the geospatial field, embracing this philosophy means staying ahead of the curve, ready to tackle the challenges and opportunities that big data, or any data, presents.

Yeah but how to adapt?

Big data analytics, especially in geospatial contexts, demands a unique blend of expertise. You need individuals who understand not only the intricacies of geospatial data but also the complexities of advanced data analytics tools and systems. You are in fact asking someone to be a jack of all trades. Yes people like this exist, but should that be the way we force people to learn?

Many organizations find themselves stuck in a catch-22. They have the data and the desire to leverage it, but they lack the personnel with the necessary skills to fully utilize these cloud native tools.

Yes you can pair your geospatial teams with cloud engineers, but you also want your teams to be nimble so as not to need infrastructure support on a frequent basis.

The other tricky part is that the modern data stack has not evolved to meet the needs of geospatial data, resulting in a lot of what I like to call “duct tape” fixes – for big data or otherwise.

For example, consider a workflow involving BigQuery, Earth Engine, and Cloud Functions. You might start by conducting an analysis in Earth Engine, moving the results to BigQuery with vectorized outputs, automating processes with Cloud Run and Cloud Functions, and finally serving the results. Sure five steps sounds easy now, but then you are the one that has to make sure it all is working all the time.

To overcome these challenges, organizations need to invest in both people and technology. Training and development programs are essential to equip teams with the skills needed to navigate the modern data landscape.

Meanwhile, choosing where you want your teams to spend their time is also key. Can you find managed services or managed infrastructure that can solve some of these headaches (aka remove the duct tape) and let your team do the work they want to while taking advantage of a cloud-native architecture? (hint hint)

I think the key here is to view these challenges not as insurmountable obstacles but as opportunities for growth and innovation. Invest in and understand the value of a cloud native infrastructure. Pick the areas you want your team to be nimble in and the areas you want to farm out to make life easier. This helps your team do more and take advantage of all the technology shifts I mentioned above.

Okay let’s bring it home here:

Yes, there is a ton of hype around data volume
But that isn’t the important part, the technology that supports big data can and often does make everyone’s life easier
Cloud-native geospatial ≠ big geospatial data
If you have ever had issues moving data around or having processes that run for hours or days, then you are in the right place.

If this sounds like the approach for you then what are the next steps you should be thinking about?

Pick to and commit to using cloud storage as the backbone of your data architecture, along with a catalog system for your data which will likely be Apache Iceberg
Pick a processing platform for your data – what infrastructure and approach (along with the functional tools inside that platform) make the most sense for you to actually process your data.
Don’t throw out your desktop GIS systems, but also pick one that works well on the web too.
Figure out what your biggest headaches and pain points are and look for a platform that can solve them. This is where you should focus, and you can add on later as you need.

🚫 Hype vs. reality: The big data debate in geospatial

Was big data in GIS ever a thing?

Matt Forrest

Spatial Lab

Policies

Spatial SQL

Join Us

Get the newsletter

🚫 Hype vs. reality: The big data debate in geospatial

Was big data in GIS ever a thing?

Matt Forrest

Spatial Lab

Policies

Spatial SQL

Join Us