Uncategorized

How to Run Scalable Geospatial Analysis with Apache Sedona – Right From Your Laptop

May 13, 2025 Matt Forrest Comments Off

I spend a lot of time talking about scaling geospatial analysis—massive datasets, remote sensing archives, distributed computation—but sometimes the best way to start is right from your laptop.

In this tutorial, we’ll set up a local Apache Spark environment using Apache Sedona, a powerful extension for scalable geospatial processing. We’ll connect to remote cloud-hosted data, run spatial SQL functions, and explore the power of distributed computing—even if you’re just on your MacBook.

If you’ve been struggling to scale your GeoPandas workflows, or if your laptop keeps choking on large vector files, this is your off-ramp to something better.

Before getting started, you can access the code and notebook in this repository:

Get the tutorial code

🧱 Step 1: Why Apache Sedona?

GeoPandas is great—until it isn’t. If you’ve tried spatial joins on millions of geometries or attempted to run raster ops on huge datasets, you’ve probably hit a wall.

Apache Sedona is a distributed geospatial analytics engine built on top of Apache Spark. It brings spatial indexing, geometry processing, and spatial SQL to the world of parallel computing. And best of all? You can get started locally.

With Sedona, you can:

Query GeoParquet and GeoTIFFs or read GeoJSON, Shapefiles, NetCDF, Geopackages at scale
Run spatial joins across hundreds of millions of rows
Integrate with cloud data lakes via S3-compatible endpoints

Let’s get into it.

⚙️ Step 1: Local Setup: Java, Sedona, and Spark Without the Pain

Setting up Java and PySpark locally can be…tricky. SDKMAN! makes it easier, and it’s the method I recommend.

🧪 Option 1: Python and SDKMAN! (Recommended)

Start here if you want a clean, fast workflow that mimics how this runs in production.

Step 1: Install SDKMAN!

curl -s "<https://get.sdkman.io>" | bash

Follow the instructions to restart your terminal, then:

sdk install java 17.0.13-zulu

Confirm it worked:

java -version
#> openjdk version "17.0.13" ...
echo $JAVA_HOME
#> /Users/<yourname>/.sdkman/candidates/java/current

Step 2: Install Python Dependencies

pip install "apache-sedona[spark]" geopandas setuptools

Or, for more precision:

pip install pyspark==3.5.3 apache-sedona==1.7.0 ipykernel pytest chispa geopandas

🐍 Option 2: Conda (Good for Isolated Environments)

conda create --name sedonafun
conda activate sedonafun
conda install -c conda-forge pyspark=3.5.3 apache-sedona=1.7.0 ipykernel pytest pip
pip install chispa

⚡ Option 3: Using `uv` for Python Projects (Fast and Reproducible)

curl -LsSf <https://astral.sh/uv/install.sh> | sh

You may need to install Geopandas in which case:

uv pip install geopandas

Inside a Sedona project folder:

uv run pytest tests
uv run ipython kernel install --user --name=sedonaexamples
uv run --with jupyter jupyter lab

🧪 Optional: Clone the Sedona Examples Repo

git clone <https://github.com/MrPowers/sedona-examples.git>
cd sedona-examples

This is a great starting point to play with test cases and notebook samples.

🛰️ Step 2: Connecting to Remote Cloud Data

Now that Spark and Sedona are running, let’s connect to a remote S3-compatible bucket. This simulates real-world workflows where your data lives in a data lake, not on disk.

from sedona.spark import SedonaContext

config = (
    SedonaContext.builder()
    
    # Connect to the JAR (Java ARchive) packages
    .config(
        "spark.jars.packages",
        ",".join([
            "org.apache.sedona:sedona-spark-3.5_2.12:1.6.1",
            "org.datasyslab:geotools-wrapper:1.7.0-28.5",
            "org.apache.hadoop:hadoop-aws:3.3.2"
        ])
    )
    .config("spark.jars.repositories", "<https://artifacts.unidata.ucar.edu/repository/unidata-all>")
    
    # Connect to remote data on Source Cooperative - you will need to sign up for an account and get your access and secret keys
    
    .config("spark.hadoop.fs.s3a.endpoint", "<https://data.source.coop>") \\
    .config("spark.hadoop.fs.s3a.access.key", "SOURCE_COOP_S3_ACCESS_KEY") \\
    .config("spark.hadoop.fs.s3a.secret.key", "SOURCE_COOP_S3_SECRET_KEY") \\
    
    # Enable S3 access
    
    .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "true") \\
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \\
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.s3a.path.style.access", "true")
    
    # You can add this if you want to use public S3 data 
    
    # .config("spark.hadoop.fs.s3a.aws.credentials.provider", 
    #         "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider") 
    
    .config("spark.executor.memory", "12G")
    .config("spark.driver.memory", "12G")
    .config("spark.sql.shuffle.partitions", "2")
    .getOrCreate()
)

sedona = SedonaContext.create(config)
sedona.sparkContext.setLogLevel("ERROR")

Here’s what’s happening:

Spark JARs: We pull in Sedona and Hadoop AWS packages so we can handle spatial ops and remote files.
S3 endpoint: We’re connecting to a custom HTTPS-based S3-compatible endpoint, not AWS. Sedona doesn’t care—it just needs to know how to talk to it.
Memory tuning: Even locally, bumping driver/executor memory can help when working with larger datasets.

💡 Pro Tip: Spark isn’t just for big clusters. Running it locally gives you access to its optimizer, lazy evaluation, and multi-threaded parallelism. That alone can supercharge your workflows.

🛰️ Step 3: Why Remote Storage Matters (Even Locally)

When you’re building scalable pipelines, remote storage is the default. Your data is rarely on the same machine as your compute environment, and that’s by design.

In our setup, we connect to a bucket hosted at https://data.source.coop, not AWS, but fully S3-compatible. Spark can natively read from these kinds of sources as long as it’s configured properly.

This setup mimics what you’d do in the cloud: decouple compute from storage, scale them independently, and process where it makes sense.

🧮 Step 4: Running Your First Spatial SQL Query

Once Sedona is set up, you can run geospatial SQL just like you would in PostGIS:

sql = """
SELECT ST_AreaSpheroid(
    ST_GeomFromWKT('Polygon ((34 35, 28 30, 25 34, 34 35))')
) as result
"""

sedona.sql(sql).show(truncate=False)

This returns the spheroid area of the polygon based on WGS84—leveraging Sedona’s spatial SQL engine under the hood.

You can just as easily use:

ST_Contains
ST_Intersects
ST_Distance
ST_Transform

The list of spatial functions Sedona supports is extensive, and for many, it’s a direct lift-and-shift from PostGIS.

🛠️ Why Sedona + Spark Makes Sense

If you’re coming from a GIS background, here’s why Apache Sedona is worth learning:

Feature	GeoPandas	Sedona + Spark
Handles large datasets	⚠️	✅
Runs on distributed clusters	❌	✅
Reads from S3/cloud storage	✅	✅
Supports spatial SQL	❌	✅
Indexing & partitioning	⚠️	✅

Even on a single machine, Spark handles memory more gracefully than most desktop tools. Add spatial indexing and parallel ops from Sedona, and you’ve got a powerful setup.

Final Project: Spatial Join Between Citi Bike Trips and NYC Neighborhoods

Let’s finish this tutorial with something practical: a spatial join between 53 million Citi Bike trips and NYC neighborhood boundaries.

Why does this matter? Because this is the kind of analysis that crashes your machine with GeoPandas—but runs cleanly with Sedona + Spark, even locally.

🔍 The Data

Bike Trips: Stored in GeoParquet on a remote S3 bucket
Neighborhood Boundaries: Local GeoParquet file (could be remote too!)

# Read Citi Bike trips from remote S3 bucket
bikes = sedona.read.format('parquet') \\
    .load('s3a://zluo43/citibike/new_schema_combined_with_geom.parquet/*/*/*.parquet')

# Read NYC neighborhood geometries
neighborhoods = sedona.read.format('geoparquet') \\
    .load('custom-pedia-cities-nyc-Mar2018.parquet')

At this point, both datasets are registered as temporary views in Spark SQL. Now we can run a spatial join using ST_Contains.

select count(b.ride_id) as rides, n.neighborhood, n.geometry
from neighborhoods n
join bikes b on st_contains(n.geometry, st_geomfromwkb(b.start_geom))
where n.geometry is not null
and b.start_geom is not null
group by n.neighborhood, n.geometry

# Execute the spatial join in Spark SQL
data = sedona.sql("""
select count(b.ride_id) as rides, n.neighborhood, n.geometry
from neighborhoods n
join bikes b on st_contains(n.geometry, st_geomfromwkb(b.start_geom))
where n.geometry is not null
and b.start_geom is not null
group by n.neighborhood, n.geometry
""")

Even on a local machine, this runs efficiently because:

Sedona indexes geometries under the hood
Spark parallelizes the join operation across partitions
We’re filtering nulls early to avoid wasted computation

🌐 What This Teaches You

This single example touches on multiple skills:

✅ Reading remote data from cloud storage

✅ Performing spatial joins in SQL

✅ Working with Parquet and GeoParquet efficiently

✅ Using Spark as a local development environment that scales

And if you’re thinking “Can this scale beyond my laptop?”—absolutely. The exact same code can run on a cloud cluster, a Wherobots environment, or even in a containerized Airflow DAG.

📦 Wrap-Up: Local Spark, Global Scale

So what did we build?

A local Apache Spark + Sedona dev environment
Cloud storage integration via S3-compatible endpoints
Spatial SQL queries that replicate PostGIS workflows
A real-world join across millions of records and complex geometries

And we did it all without setting up a cluster or writing low-level geometry code.

This is the modern geospatial workflow I wish I had years ago.

Final Thoughts

This notebook may look simple, but it introduces powerful ideas: scalable spatial analysis, remote data integration, and the flexibility of running Spark locally.

You’re not just running a polygon area function, you’re prepping for a future where terabytes of spatial data live in data lakes, and your local dev environment mirrors the cloud.

Want to see how this scales? In a future video, I’ll show you how to:

Read GeoParquet directly from remote storage
Run distributed spatial joins on millions of records
Push your Sedona code to a Wherobots environment

How to Run Scalable Geospatial Analysis with Apache Sedona – Right From Your Laptop

🧱 Step 1: Why Apache Sedona?

⚙️ Step 1: Local Setup: Java, Sedona, and Spark Without the Pain

🧪 Option 1: Python and SDKMAN! (Recommended)

Step 1: Install SDKMAN!

Step 2: Install Python Dependencies

🐍 Option 2: Conda (Good for Isolated Environments)

⚡ Option 3: Using `uv` for Python Projects (Fast and Reproducible)

🧪 Optional: Clone the Sedona Examples Repo

🛰️ Step 2: Connecting to Remote Cloud Data

🛰️ Step 3: Why Remote Storage Matters (Even Locally)

🧮 Step 4: Running Your First Spatial SQL Query

🛠️ Why Sedona + Spark Makes Sense

Final Project: Spatial Join Between Citi Bike Trips and NYC Neighborhoods

🔍 The Data

🌐 What This Teaches You

📦 Wrap-Up: Local Spark, Global Scale

Final Thoughts

Matt Forrest

Spatial Lab

Courses

Spatial SQL

Join Us

Get the newsletter

How to Run Scalable Geospatial Analysis with Apache Sedona – Right From Your Laptop

🧱 Step 1: Why Apache Sedona?

⚙️ Step 1: Local Setup: Java, Sedona, and Spark Without the Pain

🧪 Option 1: Python and SDKMAN! (Recommended)

Step 1: Install SDKMAN!

Step 2: Install Python Dependencies

🐍 Option 2: Conda (Good for Isolated Environments)

⚡ Option 3: Using uv for Python Projects (Fast and Reproducible)

🧪 Optional: Clone the Sedona Examples Repo

🛰️ Step 2: Connecting to Remote Cloud Data

🛰️ Step 3: Why Remote Storage Matters (Even Locally)

🧮 Step 4: Running Your First Spatial SQL Query

🛠️ Why Sedona + Spark Makes Sense

Final Project: Spatial Join Between Citi Bike Trips and NYC Neighborhoods

🔍 The Data

🌐 What This Teaches You

📦 Wrap-Up: Local Spark, Global Scale

Final Thoughts

Matt Forrest

Spatial Lab

Courses

Spatial SQL

Join Us

⚡ Option 3: Using `uv` for Python Projects (Fast and Reproducible)