Uncategorized

Spatial Data Lakehouse Architecture: Beyond H3 and Vector-Only Data

Storage and formats are the foundation. Pipelines make them practical. But storage and processing architectures determine what’s actually possible at scale.

We’ve talked about cloud storage enabling modern geospatial workflows and the formats that make data queryable. We’ve covered the pipelines that get your data into those formats. But there’s a bigger architectural question that determines whether you can actually build production geospatial systems that work with the three data types I outlined in the first post: satellite imagery, GPS/IoT data, and weather/climate data.

That architectural question is whether you’re building on data lake foundations, modern lakehouse architectures, or something else entirely.

The data lake and lakehouse concepts have been game-changers for analytics workloads. But as geospatial data moves toward these architectures, we need to be clear about what makes spatial data different and what a true spatial data lakehouse should actually look like.

What data lakes and lakehouses enable

Before we dive into spatial requirements, let’s establish what these architectures accomplish and why they matter for geospatial workloads.

Data lakes solved the problem of storing massive amounts of diverse data cheaply. Instead of forcing everything into predefined database schemas, you could dump raw data into cloud storage and figure out the structure later. This was perfect for the volume and variety of data that modern applications generate.

Lakehouses solved the data lake’s reliability and performance problems. Tools like Apache Iceberg, Delta Lake, and Apache Hudi brought ACID transactions, schema evolution, and time travel to data stored in cloud storage. You could get the flexibility of a data lake with the reliability and performance characteristics of a data warehouse.

Table formats like Iceberg make this possible by creating a metadata layer on top of data lake storage. Instead of just having a bunch of Parquet files in S3, you have a table that knows which files belong to it, how they’re partitioned, what schema they follow, and how they’ve changed over time.

This enables some powerful capabilities that traditional databases struggle with:

Governance and compliance. You can see exactly what data looked like at any point in time, audit who accessed what, and implement fine-grained access controls across petabyte-scale datasets.

Efficient partitioning. You can partition data by time, geography, or any other dimension and query engines will automatically read only the partitions they need.

Time travel and versioning. You can roll back to previous versions of data, compare datasets across time periods, or create reproducible analysis environments.

A concrete example: Satellite imagery with Iceberg

Let’s look at how this works with a specific geospatial example. Imagine you’re managing a table of Landsat imagery that gets updated daily with new scenes.

With traditional approaches, you might store each image as a separate GeoTIFF file in S3 with some naming convention like landsat_8_2025_01_15_path_123_row_456.tif. Finding all images for a specific location and date range requires listing thousands of files and parsing filenames.

With Iceberg, you create a table schema that includes spatial and temporal metadata:

CREATE TABLE landsat_scenes (
  scene_id STRING,
  acquisition_date DATE,
  path INT,
  row INT,
  cloud_cover FLOAT,
  geometry BINARY,  -- Well-Known Binary representation
  image_location STRING  -- S3 path to actual COG file
) PARTITIONED BY (
  year(acquisition_date),
  path
)

Now you can query this table like any other table, but Iceberg handles the complexity of finding the right files:

SELECT scene_id, acquisition_date, image_location
FROM landsat_scenes
WHERE acquisition_date >= '2024-01-01'
  AND path BETWEEN 120 AND 130
  AND cloud_cover < 10

Iceberg’s metadata layer knows which partitions contain data matching your query and will only scan those files. The query might touch 50 Parquet files instead of listing 10,000 GeoTIFF files.

This isn’t just conjecture either, you can do this today using the STAC reader in Wherobots or in Apache Sedona.

Time travel means you can see exactly what imagery was available at any point in time:

SELECT * FROM landsat_scenes
TIMESTAMP AS OF '2024-12-01 00:00:00'
WHERE path = 123 AND row = 456

Schema evolution means you can add new columns (like atmospheric correction parameters) without rewriting existing data.

Governance means you can track who accessed which imagery and implement access controls based on geographic regions or data sensitivity.

This is what makes Iceberg and other table formats powerful for geospatial data: they provide database-like capabilities on top of cloud storage-native formats.

The Databricks approach and its limitations

Databricks has written extensively about building geospatial lakehouses (you can find those posts here as well as the second part here), and their approach has some real strengths. They’ve built solid infrastructure for processing vector data at scale, with strong support for H3 spatial indexing and integration with Apache Sedona for distributed spatial processing.

But their approach makes a fundamental trade-off: they’ve optimized for discrete spatial analytics using H3 hexagonal indexing rather than preserving native geometries. H3 converts spatial data into a grid system that enables fast joins and aggregations but sacrifices spatial precision – I recently wrote about some of the issues with this on the Apache Sedona blog.

As Databricks themselves acknowledge: “Working exclusively with H3 comes with some trade-offs. Precision is dependent on your index resolution, and the conversion of polygons to index space can produce non-indexed gaps along geometry boundaries. These gaps determine the accuracy of a spatial join.” (Source)

This approach works well for certain use cases – analyzing mobility patterns, retail site selection, or supply chain optimization where approximate spatial relationships are sufficient. But it breaks down when you need precise spatial operations or when you’re working with the three data types that are driving modern geospatial growth.

For satellite imagery and earth observation data, H3 indexing doesn’t help. You can’t convert a multispectral raster into hexagonal cells without losing the spectral and spatial information that makes it valuable.

For GPS and mobility data, H3 works for aggregated analysis but fails for trajectory analysis, map matching, or routing algorithms that need precise coordinate sequences.

For weather and climate data, the multidimensional array structure (time, lat, lon, elevation, variables) doesn’t map cleanly to H3’s two-dimensional grid system.

More fundamentally, Databricks’ approach doesn’t unify these different data types. Their geospatial lakehouse is primarily designed for vector data analysis, with imagery and climate data handled as separate, specialized workflows.

What a spatial data lakehouse should be

A true spatial data lakehouse needs to handle all three core geospatial data types – vector geometries, raster imagery, and multidimensional arrays – within a unified architecture. Here’s what that requires:

Native spatial data type support

The lakehouse should support spatial data types natively, not just as approximations. This means:

  • Geometry columns that preserve exact coordinate sequences, not just H3 approximations
  • Raster columns that reference imagery with spatial metadata (coordinate systems, pixel size, band information)
  • Array columns that handle multidimensional data with spatial and temporal dimensions

Spatial-aware partitioning strategies

Standard partitioning by date or categorical values isn’t sufficient for spatial data. A spatial lakehouse needs:

  • Geographic partitioning that can organize data by spatial extents, not just administrative boundaries
  • Multi-resolution partitioning that can efficiently handle data at different spatial scales
  • Temporal-spatial partitioning for time series data that varies by location

Cross-format query capabilities

The lakehouse should be able to join vector data with raster data with array data in the same query. This means:

  • Spatial predicates that work across data types (point-in-raster, geometry-intersects-array-cell)
  • Temporal joins that can align data with different temporal resolutions
  • Coordinate system handling that can reproject data on-the-fly during queries

Proper spatial indexing

Spatial queries need spatial indices, not just alphabetical sorting. This requires:

  • R-tree or similar spatial sorting (Hilbert curve) for geometry data
  • Spatial chunking strategies for raster and array data
  • Multi-level indexing that can handle data at different scales efficiently

Governance for spatial data

Spatial data has unique governance requirements:

  • Lineage tracking that includes coordinate system transformations and spatial operations
  • Access controls based on geographic boundaries or data sensitivity
  • Audit trails that track spatial queries and data extractions

PostGIS vs. Lakehouse architectures

This brings up a question I get frequently: when should you use PostGIS versus a lakehouse architecture for spatial data?

PostGIS excels when:

  • You need precise spatial operations and topology
  • Your datasets fit on a single powerful machine
  • You have complex spatial relationships that require referential integrity
  • You need real-time spatial queries with low latency
  • Your team is comfortable with SQL and database administration

Lakehouse architectures excel when:

  • You’re working with petabyte-scale datasets
  • You need to join spatial data with large business datasets
  • You want to leverage cloud-native tooling and elastic compute
  • You need time travel and schema evolution capabilities
  • You’re building ML pipelines that consume spatial features

PostGIS gives you a mature, full-featured spatial database with decades of spatial algorithm development. But it’s fundamentally a single-node solution that requires careful capacity planning.

Lakehouse architectures give you unlimited scale and integration with modern data tooling. But spatial capabilities are still evolving, and you may need to implement spatial operations that PostGIS provides out of the box.

The emerging pattern is using both: PostGIS for operational systems that need low-latency spatial queries, and lakehouse architectures for analytical workloads that need to process massive datasets.

The path forward

The geospatial industry is still figuring out what spatial data lakehouses should look like. The current solutions – whether from Wherobots, Databricks, or others – are early attempts that solve parts of the problem but don’t yet provide a complete solution.

What we need is an architecture that combines the best elements of data lakes and data warehouses for spatio-temporal data without forcing trade-offs between spatial precision and scale.

This means:

Native support for cloud-native geospatial formats. GeoParquet, Zarr, and COG should be first-class citizens, not afterthoughts.

Unified query interfaces that can process vector, raster, and array data in the same analytical workflows.

Spatial-aware optimizations throughout the stack, from storage layout to query planning to caching strategies.

Integration with existing geospatial tooling so you don’t have to choose between modern data architectures and mature spatial capabilities.

The organizations that figure this out first, that can build production systems handling petabytes of diverse geospatial data with the reliability and governance capabilities of modern data platforms, are going to have significant competitive advantages.

What’s next

Lakehouse architectures provide the foundation for scalable geospatial analytics, but they’re only as good as the query engines that can actually process the data efficiently.

In my next post, I’ll dive into the query engines and analytics tools that can consume cloud-native geospatial data at scale. How do you choose between DuckDB, Apache Sedona, Trino, and specialized geospatial query engines? What are the performance characteristics of different approaches? And how do you build analytical workflows that can actually take advantage of lakehouse capabilities for geospatial data?

The formats make it possible. The pipelines make it practical. The lakehouses make it governable. The query engines make it fast.