Uncategorized

How to Run Scalable Geospatial Analysis with Apache Sedona – Right From Your Laptop

I spend a lot of time talking about scaling geospatial analysis—massive datasets, remote sensing archives, distributed computation—but sometimes the best way to start is right from your laptop.

In this tutorial, we’ll set up a local Apache Spark environment using Apache Sedona, a powerful extension for scalable geospatial processing. We’ll connect to remote cloud-hosted data, run spatial SQL functions, and explore the power of distributed computing—even if you’re just on your MacBook.

If you’ve been struggling to scale your GeoPandas workflows, or if your laptop keeps choking on large vector files, this is your off-ramp to something better.

Before getting started, you can access the code and notebook in this repository:


🧱 Step 1: Why Apache Sedona?

GeoPandas is great—until it isn’t. If you’ve tried spatial joins on millions of geometries or attempted to run raster ops on huge datasets, you’ve probably hit a wall.

Apache Sedona is a distributed geospatial analytics engine built on top of Apache Spark. It brings spatial indexing, geometry processing, and spatial SQL to the world of parallel computing. And best of all? You can get started locally.

With Sedona, you can:

  • Query GeoParquet and GeoTIFFs or read GeoJSON, Shapefiles, NetCDF, Geopackages at scale
  • Run spatial joins across hundreds of millions of rows
  • Integrate with cloud data lakes via S3-compatible endpoints

Let’s get into it.

⚙️ Step 1: Local Setup: Java, Sedona, and Spark Without the Pain

Setting up Java and PySpark locally can be…tricky. SDKMAN! makes it easier, and it’s the method I recommend.

🧪 Option 1: Python and SDKMAN! (Recommended)

Start here if you want a clean, fast workflow that mimics how this runs in production.

Step 1: Install SDKMAN!

curl -s "<https://get.sdkman.io>" | bash

Follow the instructions to restart your terminal, then:

sdk install java 17.0.13-zulu

Confirm it worked:

java -version
#> openjdk version "17.0.13" ...
echo $JAVA_HOME
#> /Users/<yourname>/.sdkman/candidates/java/current

Step 2: Install Python Dependencies

pip install "apache-sedona[spark]" geopandas setuptools

Or, for more precision:

pip install pyspark==3.5.3 apache-sedona==1.7.0 ipykernel pytest chispa geopandas

🐍 Option 2: Conda (Good for Isolated Environments)

conda create --name sedonafun
conda activate sedonafun
conda install -c conda-forge pyspark=3.5.3 apache-sedona=1.7.0 ipykernel pytest pip
pip install chispa

⚡ Option 3: Using uv for Python Projects (Fast and Reproducible)

curl -LsSf <https://astral.sh/uv/install.sh> | sh

You may need to install Geopandas in which case:

uv pip install geopandas

Inside a Sedona project folder:

uv run pytest tests
uv run ipython kernel install --user --name=sedonaexamples
uv run --with jupyter jupyter lab

🧪 Optional: Clone the Sedona Examples Repo

git clone <https://github.com/MrPowers/sedona-examples.git>
cd sedona-examples

This is a great starting point to play with test cases and notebook samples.


🛰️ Step 2: Connecting to Remote Cloud Data

Now that Spark and Sedona are running, let’s connect to a remote S3-compatible bucket. This simulates real-world workflows where your data lives in a data lake, not on disk.

from sedona.spark import SedonaContext

config = (
    SedonaContext.builder()
    
    # Connect to the JAR (Java ARchive) packages
    .config(
        "spark.jars.packages",
        ",".join([
            "org.apache.sedona:sedona-spark-3.5_2.12:1.6.1",
            "org.datasyslab:geotools-wrapper:1.7.0-28.5",
            "org.apache.hadoop:hadoop-aws:3.3.2"
        ])
    )
    .config("spark.jars.repositories", "<https://artifacts.unidata.ucar.edu/repository/unidata-all>")
    
    # Connect to remote data on Source Cooperative - you will need to sign up for an account and get your access and secret keys
    
    .config("spark.hadoop.fs.s3a.endpoint", "<https://data.source.coop>") \\
    .config("spark.hadoop.fs.s3a.access.key", "SOURCE_COOP_S3_ACCESS_KEY") \\
    .config("spark.hadoop.fs.s3a.secret.key", "SOURCE_COOP_S3_SECRET_KEY") \\
    
    # Enable S3 access
    
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true") \\
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    
    # You can add this if you want to use public S3 data 
    
    # .config("spark.hadoop.fs.s3a.aws.credentials.provider", 
    #         "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") 
    
    .config("spark.executor.memory", "12G")
    .config("spark.driver.memory", "12G")
    .config("spark.sql.shuffle.partitions", "2")
    .getOrCreate()
)

sedona = SedonaContext.create(config)
sedona.sparkContext.setLogLevel("ERROR")

Here’s what’s happening:

  • Spark JARs: We pull in Sedona and Hadoop AWS packages so we can handle spatial ops and remote files.
  • S3 endpoint: We’re connecting to a custom HTTPS-based S3-compatible endpoint, not AWS. Sedona doesn’t care—it just needs to know how to talk to it.
  • Memory tuning: Even locally, bumping driver/executor memory can help when working with larger datasets.

💡 Pro Tip: Spark isn’t just for big clusters. Running it locally gives you access to its optimizer, lazy evaluation, and multi-threaded parallelism. That alone can supercharge your workflows.


🛰️ Step 3: Why Remote Storage Matters (Even Locally)

When you’re building scalable pipelines, remote storage is the default. Your data is rarely on the same machine as your compute environment, and that’s by design.

In our setup, we connect to a bucket hosted at https://data.source.coop, not AWS, but fully S3-compatible. Spark can natively read from these kinds of sources as long as it’s configured properly.

This setup mimics what you’d do in the cloud: decouple compute from storage, scale them independently, and process where it makes sense.


🧮 Step 4: Running Your First Spatial SQL Query

Once Sedona is set up, you can run geospatial SQL just like you would in PostGIS:

sql = """
SELECT ST_AreaSpheroid(
    ST_GeomFromWKT('Polygon ((34 35, 28 30, 25 34, 34 35))')
) as result
"""

sedona.sql(sql).show(truncate=False)

This returns the spheroid area of the polygon based on WGS84—leveraging Sedona’s spatial SQL engine under the hood.

You can just as easily use:

  • ST_Contains
  • ST_Intersects
  • ST_Distance
  • ST_Transform

The list of spatial functions Sedona supports is extensive, and for many, it’s a direct lift-and-shift from PostGIS.


🛠️ Why Sedona + Spark Makes Sense

If you’re coming from a GIS background, here’s why Apache Sedona is worth learning:

FeatureGeoPandasSedona + Spark
Handles large datasets⚠️
Runs on distributed clusters
Reads from S3/cloud storage
Supports spatial SQL
Indexing & partitioning⚠️

Even on a single machine, Spark handles memory more gracefully than most desktop tools. Add spatial indexing and parallel ops from Sedona, and you’ve got a powerful setup.


Final Project: Spatial Join Between Citi Bike Trips and NYC Neighborhoods

Let’s finish this tutorial with something practical: a spatial join between 53 million Citi Bike trips and NYC neighborhood boundaries.

Why does this matter? Because this is the kind of analysis that crashes your machine with GeoPandas—but runs cleanly with Sedona + Spark, even locally.

🔍 The Data

  • Bike Trips: Stored in GeoParquet on a remote S3 bucket
  • Neighborhood Boundaries: Local GeoParquet file (could be remote too!)
# Read Citi Bike trips from remote S3 bucket
bikes = sedona.read.format('parquet') \\
    .load('s3a://zluo43/citibike/new_schema_combined_with_geom.parquet/*/*/*.parquet')

# Read NYC neighborhood geometries
neighborhoods = sedona.read.format('geoparquet') \\
    .load('custom-pedia-cities-nyc-Mar2018.parquet')

At this point, both datasets are registered as temporary views in Spark SQL. Now we can run a spatial join using ST_Contains.

select count(b.ride_id) as rides, n.neighborhood, n.geometry
from neighborhoods n
join bikes b on st_contains(n.geometry, st_geomfromwkb(b.start_geom))
where n.geometry is not null
and b.start_geom is not null
group by n.neighborhood, n.geometry
# Execute the spatial join in Spark SQL
data = sedona.sql("""
select count(b.ride_id) as rides, n.neighborhood, n.geometry
from neighborhoods n
join bikes b on st_contains(n.geometry, st_geomfromwkb(b.start_geom))
where n.geometry is not null
and b.start_geom is not null
group by n.neighborhood, n.geometry
""")

Even on a local machine, this runs efficiently because:

  • Sedona indexes geometries under the hood
  • Spark parallelizes the join operation across partitions
  • We’re filtering nulls early to avoid wasted computation

🌐 What This Teaches You

This single example touches on multiple skills:

✅ Reading remote data from cloud storage

✅ Performing spatial joins in SQL

✅ Working with Parquet and GeoParquet efficiently

✅ Using Spark as a local development environment that scales

And if you’re thinking “Can this scale beyond my laptop?”—absolutely. The exact same code can run on a cloud cluster, a Wherobots environment, or even in a containerized Airflow DAG.


📦 Wrap-Up: Local Spark, Global Scale

So what did we build?

  • A local Apache Spark + Sedona dev environment
  • Cloud storage integration via S3-compatible endpoints
  • Spatial SQL queries that replicate PostGIS workflows
  • A real-world join across millions of records and complex geometries

And we did it all without setting up a cluster or writing low-level geometry code.

This is the modern geospatial workflow I wish I had years ago.


Final Thoughts

This notebook may look simple, but it introduces powerful ideas: scalable spatial analysis, remote data integration, and the flexibility of running Spark locally.

You’re not just running a polygon area function, you’re prepping for a future where terabytes of spatial data live in data lakes, and your local dev environment mirrors the cloud.


Want to see how this scales? In a future video, I’ll show you how to:

  • Read GeoParquet directly from remote storage
  • Run distributed spatial joins on millions of records
  • Push your Sedona code to a Wherobots environment