Uncategorized

GeoParquet Explained: The Cloud-Native Spatial Format Transforming Modern GIS

If you have been working in GIS for a while, you probably know shapefiles and GeoJSON. But there is a new format reshaping how we store, share, and analyze spatial data. It is called GeoParquet, and unlike traditional GIS formats, it was built for the modern data stack rather than desktop mapping workflows.

GeoParquet is already powering some of the largest open spatial datasets in the world, including Overture Maps. It allows analysts, data engineers, and GIS professionals to work with massive vector datasets efficiently, stream them over the cloud, and connect them to distributed compute engines like Apache Sedona, DuckDB, Spark, and BigQuery.

If you want to handle big spatial data, run fast queries, or build pipelines that scale, GeoParquet is a format you need to understand. This guide breaks down exactly what it is, why it matters, how it works, when to use it, and how to get started.


What GeoParquet Actually Is

At its core, GeoParquet is Parquet with a defined way to store geometry. Parquet itself was created in 2013 by Twitter and Cloudera for the Hadoop ecosystem. It is a columnar storage format designed for fast analytics, compressed storage, and distributed processing.

GeoParquet takes that foundation and adds:

  • A dedicated geometry column
  • Spatial metadata, including CRS and bounding boxes
  • A consistent, open specification so different tools interpret geometry the same way

Importantly, GeoParquet is not a new GIS file type. It is Parquet extended to support geometry natively. That small change creates major benefits, because now spatial data behaves exactly like every other analytical dataset in the modern data stack.


Why Columnar Storage Matters for Spatial Data

Most GIS formats store data by row. Each feature is a row containing attributes and geometry. A columnar format like Parquet flips that model. It stores each column—geometry, attributes, numeric values, strings—in separate blocks.

This structure enables:

  • Reading only the columns you need
  • Far more efficient queries
  • Better compression
  • The ability to skip huge parts of the dataset based on metadata

In addition, Parquet files are typically split into partitions and row groups, allowing query engines to skip non-relevant chunks without loading everything into memory. Add spatial metadata and bounding boxes to each partition and row group, and you get extremely fast spatial filtering.

This is why GeoParquet performs so well with distributed engines and cloud-native workflows.


Why GeoParquet Matters for Modern GIS

Most GIS tools were built for local files and interactive editing. GeoParquet was built for large-scale analytics and cloud storage. That distinction changes everything.

1. You can work with massive datasets directly from the cloud

Overture Maps is one of the best examples. Millions of features can be filtered, streamed, and queried without downloading the entire dataset. Desktop GIS users can read only the features inside a bounding box rather than loading the world.

2. It enables interoperability across GIS and data engineering

GeoParquet works seamlessly with:

  • Geopandas
  • GDAL
  • DuckDB
  • Apache Sedona
  • BigQuery
  • Snowflake
  • PyArrow

Multiple teams can work with the same dataset using completely different tools.

3. It unlocks true cloud-native spatial data lakes

Because GeoParquet can be stored in object storage and read efficiently over HTTP, it fits perfectly with data lake and lakehouse architectures. Distributed engines can push down spatial filters, skip irrelevant partitions, and process only what’s needed.

4. It is becoming the default format for vector data at scale

As more tools adopt read and write support, GeoParquet is moving from emerging standard to mainstream infrastructure.


When You Should Use GeoParquet

Not every GIS workflow requires GeoParquet. But the following situations are strong fits.

Personal Projects

If you are learning modern spatial workflows, experimenting with large datasets, or working with cloud-hosted vector data, GeoParquet is a great choice.

If you are simply digitizing a few features in desktop GIS, you probably do not need it.

One-Off Projects

If your data is already in a workable format (shapefile, GeoJSON, geopackage), converting just for the sake of it adds no value.

However, GeoParquet is excellent if your one-off project involves:

  • Millions of features
  • Repeated filtering
  • Storing the final dataset efficiently
  • Using modern query engines

Small Teams

GeoParquet shines when GIS and data engineering teams need to collaborate. It provides an interoperable, compressed, cloud-friendly way to store and share spatial data. This makes it ideal for analytics-heavy environments where datasets grow quickly.

Enterprise Workflows

For organizations running cloud platforms, distributed processing, and data lake architectures, GeoParquet is the best long-term vector format. It supports scaling, efficiency, shared access, and modern compute patterns.


When You Should Not Use GeoParquet

GeoParquet is not designed for every workflow.

You should not use it for:

  • Interactive editing workflows
  • CAD or topology-heavy datasets
  • Web map tiling (use PMTiles or MVTs)
  • Traditional GIS server publishing
  • Tools that do not yet support Parquet
  • Simple CSV or shapefile projects with small datasets

If you are doing classic desktop GIS, GeoPackage or shapefiles remain perfectly fine.


How to Get Started with GeoParquet

The easiest entry point is Python.

1. Try GeoPandas

Read any GIS file and write it to GeoParquet:

import geopandas as gpd
gdf = gpd.read_file("data.shp")
gdf.to_parquet("data.parquet")

2. Try DuckDB

DuckDB can read Parquet and run fast local SQL queries. It is perfect for learning columnar patterns.

3. Try Apache Sedona or Spark

If you want to work at scale or write large volumes of GeoParquet, distributed engines are the right fit.

4. Use cloud-native storage

Store your data in object storage (S3, GCS, Azure) and read it directly without downloading.

5. Use QGIS

QGIS has a plugin that supports reading GeoParquet. Many workflows simply work out of the box.


Common Pitfalls and How to Avoid Them

1. Not all tools can write GeoParquet yet

Some tools only support reading. Check tool documentation before building a pipeline.

2. Partitioning matters

Spatially partitioning your data significantly improves query performance. Poor partitioning can slow everything down.

3. Metadata matters

Bounding boxes, CRSs, and geometry metadata must be written correctly to get full performance benefits.

4. Query engines behave differently

Sedona, DuckDB, Spark, and warehouse engines each optimize queries differently. Efficient files require understanding your engine.


Final Takeaways

GeoParquet is more than a new file format. It represents a shift in how GIS interacts with the modern data ecosystem. It enables scalable spatial analysis, cloud-native processing, and shared workflows between GIS teams, analysts, and data engineers.

You do not have to replace your existing formats to get started. But understanding GeoParquet prepares you for where the industry is moving. As datasets get larger and organizations adopt more cloud-native patterns, GeoParquet becomes an essential part of the spatial data stack.

If you want to explore this further, you can check the official GeoParquet specification, cloud-native geospatial guides, or my courses where we walk step-by-step through building full spatial pipelines.