One of the top questions I get asked is what my favorite resources are for learning any given topic, and spatial SQL is always near the top of that list
It’s no secret that I am a big fan of spatial SQL (see my past posts on the topic). It’s speed, scale, and ability to analyze data is incredible and it has helped me solve complex problems that I never could have imagined before. All while giving me complete flexibility over my workflow.
One of the challenges is that when you are learning a new toolkit in your geospatial stack, not only do you need to understand the specific geospatial components, but you also need to understand the details of the tool or language just like everyone else. So with that in mind, I put together some of my favorite resources to help you learn spatial SQL and SQL in general as well.
Even with all these resources, there is no substitute for working with these languages hands-on, trial and error, and applying them to your own work. I hope you find these helpful and encourage you to get out there and try it yourself!
So, why should I even learn spatial SQL?
1. “Spatial SQL – GIS without the GIS” – The MapScaping Podcast with Paul Ramsey
If you are curious about why you should learn spatial SQL, what you can do with it, and how you can use it with just about any geospatial stack, start here. The MapScaping Podcast, which is a must for anyone in geospatial, dives deep into spatial SQL with Paul Ramsey, one of the creators of PostGIS.
2. “It’s not called Big Data anymore, it’s just data” – Location Matters Podcast with Chad Jennings and Matt Forrest
At the end of 2020, I had the opportunity to join Chad Jennings (Google BigQuery Product Manager), Dion Fleming (GIS Analyst at NGIS), and Sarah Butler (Marketing Manager at NGIS) to discuss spatial SQL in BigQuery and how SQL can be a common spatial language, and how you can use the power of data warehouses to scale to massive amounts of data.
3. “Why should you care about PostGIS? — A gentle introduction to spatial databases” – Topi Tjukanov
This is a great article that talks about PostGIS and why you should use it, the speed, data visualization, and how to get started.
4. “The case for spatial SQL” – Matt Forrest
This post ended up kicking off a whole series of posts on the topic of spatial SQL, and it summarizes my thoughts and experiences with spatial SQL.
Okay, so how can I get started?
To get started you will need to set up or start using a spatial database. And for that, you have a few different choices, but my recommendations would be either PostGIS or a spatially enabled data warehouse. Here are some resources to get started on either.
A quick note about choosing a database
Personally, the two choices I would recommend would be PostGIS or BigQuery. There are varied and valid reasons for choosing one over another, but both will provide you with great spatial functionality and performance.
I was primarily a PostGIS user for many years, and only recently started using BigQuery, namely because the size of the data I was working with made more sense for BigQuery. This is not to say that I think it’s the only route, as PostGIS is fast and functional, and of course free and open source.
Some reasons for choosing PostGIS might be:
- It is completely free and open source
- If you are used to a desktop interface like QGIS, PostGIS has some great options to work with QGIS
- You want to build and maintain your database locally on your machine
- You want to use a cloud based service to host your database
- You want a rich set of spatial functions that cover many different use cases (geometry management, spatial relationships, transformations, etc.)
- If you work with 3D data
- Using other open source tools to manage and query against your database
- You want to build map tiles from your database with PostGIS functions with accompanying API services LINK
- You want to tap into a large community of users around the world
Some reasons for choosing BigQuery might be:
- You are using very large data (generally large file sizes and number of rows)
- Want to take advantage of serverless computing (low storage cost, and pay as you query)
- Comfortable with a pay as you go model
- You also want accompying cloud services (storage, VMs, server storage, Apache Beam, etc.)
- Use machine learning models directly in the data warehouse in SQL
- Use cloud based tools to ingest geospatial data into BigQuery
- Have preloaded public datasets to start working with
- You want to work with a UI from the start
Two of the most popular open-source geospatial tools, together. This video will walk you through using a PostGIS database with QGIS. You will need to have a PostGIS database already, but keep going for some options for setting one up.
5. Using a spatially enabled data warehouse
The next option is to use a data warehouse. There are lots of definitions of what’s data warehouse is but generally this article from Google Cloud does a good job of explaining it.
For our purposes, data warehouses let us query and analyze spatial data, but at much larger scales than a traditional database. For example, we could do a spatial join on millions of rows, in just a few seconds.
- You are working with large data
- Use or want to use a cloud provider
- Want a solution with preloaded data
There are several data warehouse options for spatial SQL including BigQuery (Google), Redshift (AWS), Snowflake, Azure, and others.
My personal recommendation for getting started would be BigQuery. It comes with a generous free tier of 10GB of monthly storage and 1TB of query compute. You can manage and set limits in your account so you won’t exceed the free tier.
To get started with BigQuery here are four great videos and a podcast to get you going:
6. Getting started with spatial SQL
I also put together this guide to help you get started with some basic SQL statements and resources for taking your first steps with SQL and spatial SQL.
7. Setting up PostGIS
PostGIS is a great option if you don’t have massive scale needs (and PostGIS can certainly scale as needed) and you want more functions for managing geometries, 3D, spatial relationships, and more.
There are many different options for setting up PostGIS:
- PostGIS Install Instructions
- Set up a local database with Postgres.app and query/load data using Postico for Mac users
- Use the Docker image of PostGIS
- Video walkthrough using the Windows installer
Adding data to your database
So you have set up your database or are ready to start using your data warehouse. First thing you will need is some data! Getting data into your database isn’t as simple is opening a file and dragging it into your desktop app, but it is easy enough with a few simple steps.
8. Loading data to BigQuery
There are a few different options for loading data into BigQuery:
First, you can use the CSV loader for BigQuery to load a CSV, and translate your spatial data using GDAL’s ogr2ogr command to change basically any file type to CSV. This post outlines the exact steps to go through.
Next, you can load GeoJSON data directly into BigQuery using their command-line tool:
9. Loading data to PostGIS – Crunchy Data
There are a lot of different tools to load data into a PostGIS instance, and this post from Crunchy Data does a great job of summarizing them. There are tools for those who prefer the command line, those who want to use desktop tools (QGIS), and more.
Read the docs!
The number one thing that is going to make you self-sufficient in spatial SQL, or really any programming language, is learning to use the docs. PostGIS and BigQuery are both excellently documented and have clear docs, many times with illustrations, and sample queries. When you start out learning spatial SQL, there are lots of great resources (yes Stack Overflow can be a massive help and you should use it), but as you progress using the docs will become your go to more and more.
If you are new to using docs they can sometimes be a bit overwhelming since the examples may seem like they don’t immediately apply to your use case or day-to-day work. But being able to search, navigate, and use docs is critical.
A few tips for using the docs:
First, read what the function does in detail
The first thing to do when looking at the docs is to see what the function does and also what it returns. For this example we can take a look at ST_Intersects in PostGIS:
As we can see, the function ST_Intersects accepts two arguments, two geometries or geographies, and returns a boolean if they intersect.
Always make sure to look at the arguments, their order, and the data types for each argument. Also, focus on what the function returns. A function could return a number, string, boolean, geometry, or more.
Look at any notes or caveats
These sections will help you see if there are any gotchas or things to watch out for. In this case, we can see that the function can make use of spatial indices and that there is a very small distance tolerance for intersections.
Check the examples and similar functions
This type of information can help you see what other functions you should be interested in that are similar or related to what you are currently looking at.
Make sure to review the examples as well. Oftentimes they are greatly oversimplified but they will provide the most basic example to show how the function can be used.
11. PostGIS Docs
The PostGIS docs are incredibly detailed and have lots of information for developers, DBAs, analysts, engineers, and more. In my opinion, once you learn to navigate the docs and what is relevant for you, they have the most detailed information related to spatial SQL.
Additionally almost every other spatial SQL tool uses the same or similar function structure to PostGIS so regardless of the spatial database you are using, PostGIS docs are still relevant.
The examples above are from the PostGIS database but here are a few specific sections I would make sure to focus on (if you are focused on spatial analysis).
- 5. Reference – All the functions listed out by section
- 4.2. Spatial Queries
- 9.1. Functions Index – Functions with details
The next would be the BigQuery Docs. The docs are a little less detailed than PostGIS but they have some great tutorials to get started, and I find the way they organize the functions into different sections is helpful and can help you understand the scope of what can be accomplished with BigQuery.
Finally, the CARTO Spatial Extension provides great “extended” spatial capabilities to BigQuery. Some highlights are:
- H3, S2, and Quadkey aggregation and geometry methods
- BigQuery Tiler
- Enrichment with data from the CARTO Data Observatory
Of course, the challenge for using spatial SQL is that you not only have to know spatial SQL but also have strong knowledge of SQL itself.
There are a few ways to approach learning SQL for using spatial SQL. I have seen some people try to go deep and learn everything they can from the beginning. Not only is that going to be overwhelming, but there may be parts you may not use or will only use sparsely. Personally, I think it is best to practice some key fundamentals, then find what applies to your use case and build upon it from there. The key areas to start to focus on (in my opinion) are:
- WHERE, AND/OR, (NOT) IN, BETWEEN
- Handing NULL data
- Using SQL functions – spatial and non-spatial
- Aggregations – MIN/MAX/AVG/SUM/COUNT and GROUP BY
- CTEs or Common Table Expressions
If you are just getting started with SQL, here are some great resources for getting started.
Having a handy cheat sheet is something that is always handy to have around to reference different concepts and LearnSQL produced a great one here for basic SQL.
15. PostgreSQL Tutorial and W3 Schools
Two excellent free resources to reference and review specific concepts in both the standard SQL and PostgreSQL flavors of SQL. Both sites provide in-depth examples and quick references to different functions and concepts, almost like docs but a bit more accessible. I’ve used these time and time again to quickly check functions or concepts against simple and easy-to-follow examples.
If you are looking for a simple step-by-step tutorial, Mode provides a great set of exercises and examples that can be used in their free tier with data and a simple user interface ready for you to start practicing.
This has to be one of my favorite resources for actually practicing SQL concepts. It has interactive free lessons for beginner concepts and more advanced concepts as well. You can go through each lesson with sample data right from the browser with no additional setup or programming. Their lessons and tasks for each section are super clear and you can type and see how you are doing. It’s such a simple and easy-to-use tool to get started right now.
If you are looking for some different guides and resources to take your SQL to the next level, these are my top resources that cover some more advanced concepts such as joins, window functions, CTEs, and more.
More cheat sheets! These are very handy for joins and window functions as it really helps to have a quick and visual reference when you are working through a complex join. As these are more advanced concepts I really enjoy the visual guides to help you think through them in detail.
This resources guide from Danny Ma (Data by Danny) has tons of tutorials, videos, and use cases for different SQL problems.
These are some of my favorite tutorials for these concepts as they provide really clear step-by-step use cases and guides, as well as calling out areas that could trip you up. Such great detail and really well laid out. Below is the last post, and you can access the rest from that page.
Okay, so this is one of my favorite things about BigQuery. You can actually build and run many machine learning models right in BigQuery using just SQL. If you have ever worked on a machine learning project, you may know some of the complexity:
- Feature engineering with your data
- Moving data down to a ML environment like Jupyter
- Scaling your data
- Running and testing different models
- Feature selection
- Hyperparameter tuning
And what if your data changes? You have to start all over again. BigQuery ML does all of this and more using just SQL. Anything you can create with a SQL query you can now turn into a machine learning model.
It may seem complex at first but BigQuery has really great tutorials to help you get started:
Some of my favorites:
- Building and using a classification model on census data
- Using the BigQuery ML Hyperparameter Tuning to improve model performance
- Multiple time-series forecasting with a single query for NYC Citi Bike trips
- Creating a k-means model to cluster London bicycle hires dataset
22. 10 Advanced SQL Concepts – Towards Data Science
One final article shows some really advanced concepts such as recursive CTEs, temporary functions, pivots, rankings, lags, and running totals.
23. Key SQL Topics for Data Science – Nick Singh
If you are interested in learning more about data science, I would definitely recommend following Nick Singh. He has lots of great posts and content, including this one with lots of participation around key SQL topics for data scientists.
Spatial SQL Basics
Now that you have spent some time learning SQL in general, it’s time to dive into the spatial side of SQL. There are a few key concepts to know before getting started:
Know the difference between a GEOMETRY and GEOGRPAHY
PostGIS has two spatial data types, a GEOMETRY, which assumes your data lives on a Cartesian plane, and GEOGRAPHY, which is a point in the earth’s surface.
Make sure you know what your function is returning
When you use a spatial SQL function pay attention to what it is returning! It could return a new geometry, a number, string, or boolean.
Store your geometries
Whenever possible, store your geometries in a column instead of calculating them on the fly with a function. Your queries will be much much faster.
24. PostGIS Workshop
One of the best, again from the PostGIS Docs. The PostGIS tutorial has tons of Example queries, reviews almost every available function, and gives amazing tips and tricks. There is a TON of content here so take it in pieces and sections, but this is one of the definitive and most detailed spatial SQL guides around.
I recommend sections 6-29 if you are focused on analysis and the rest of you are a developer or more advanced user.
25. BigQuery GIS Training at Geo for Good 2019 – Sean Wohltman
This video from the Geo for Good 2019 conference by Sean Wohltman provides a great overview of spatial functionality and use cases in BigQuery, including using BigQuery GeoViz and other tools in GCP.
26. PostGIS Intro at STL PostGIS Day 2019 – Paul Ramsey
This is a detailed, yet approachable video introducing PostGIS, how it was developed, various use cases, and more advanced techniques as well.
27. Spatial Data Science with PostGIS – Abdishakur Hassan
If you don’t follow Abdishakur Hassan on Medium I highly recommend doing so now. He has some of the best content around all things spatial data science as well as a great course for geospatial Python. This article gives a great overview of setting up PostGIS and some various starter queries.
Advanced Spatial SQL
Jumping into some more advanced concepts, these guides and articles will focus on more advanced use cases with some really interesting problems you can solve all with spatial SQL.
Understanding which spatial relationship function to use and why is an important piece of knowledge. Many of them look similar but you can get different results depending on the function you use. This tutorial from the PostGIS workshop does a great job explaining the different functions, providing visual explainers, and use case examples.
29. PostGIS Tips and Tricks – Abel Vásquez Montoro
Everything from spatial indexes, arrays, lateral joins, and more. This guide goes deep into some more advanced SQL and spatial SQL concepts for PostGIS.
30. Bulk Nearest Neighbor Joins – Paul Ramsey
I can’t tell you how many times I have used this query. It may look a bit daunting at first, but in essence what it does is creates a for loop for each row of the table, where you can grab value(s) from another query.
In this case, each row of the query returns the nearest ID and the distance to that point (between properties and fire hydrants). There are so many use cases for this:
- Distance to a neighbor
- Average values of polygons touching other polygons
- Array of IDs touching within a certain distance to a target
- Percentage overlap of touching polygons
I also shared how you can do the same in standard SQL, however using the LATERAL keyword in PostgreSQL will be more efficient.
31. Michael Entin on Medium
If you are a BigQuery user then you definitely need to check out all of the posts by Michael Entin on his Medium. Even if you don’t use BigQuery there are some excellent tips and ideas in the posts. Some of my favorites are:
- Call PostgreSQL from BigQuery for extra GIS powers
- Nearest neighbor in BigQuery GIS
- Did my query get optimized spatial join?
- BigQuery GIS performance tips
- Given a zip code, find zip codes immediately bordering it
32. Spatial Interpolation of Census Data – Mathias Schläffer
Downscaling is, in my opinion, a critical skill and step in the process of many types of spatial analysis. There are plenty of documented side effects of using the wrong spatial scale (see the very real-world problems using zip codes in this old post of mine) and downscaling helps to alleviate that in some cases.
This is a great tutorial on how to do this in SQL, specifically with BigQuery and H3 cells. Using underlying data, both OpenStreetMap and building footprints, you can downscale Census data to H3 cells to see a more accurate view of how population is distributed.
33. Spatial SQL Joins and Aggregations
Joins and aggregations are tricky enough in standard SQL, but knowing how to use them and when with spatial SQL can be even more challenging. I put together two posts to help with those two areas, focusing specifically on spatial use cases.
Spatial SQL is just code, unless you apply it to some interesting use cases! Here are some of my favorite and most interesting ways to use spatial SQL.
34. Flight Lines with SQL – Paul Ramsey
This is a cool one and also shows some of the different challenges you might face, like handling data that crosses the antimeridian. This is a great first use case to get started with, maybe even using plain point and route data to start with.
35. 10,000 Dutch Bridges – Kyle Pennell
Accessing OpenStreetMap data can be very difficult despite the range of options you have to access it. There are numerous APIs, extracts, Python libraries, and other tools. Even then, once you have the days you still need to query or filter it to get what you want.
One of my favorite things about BigQuery is that they have all the OSM data in their public datasets for you ready to be queried. The first time around this can seek daunting since, if you haven’t queried against array data before, the queries can look confusing.
This post explains exactly how to do this using a simple example of extracting bridges in the Netherlands by searching through tags and country locations to get the right data.
36. Creating Isovists in PostGIS – Abel Vásquez Montoro
This one is really cool – creating a polygon that represents what is within a person’s line of sight in a specific direction and location. This contains a lot of different PostGIS functions to create a really practical example that can be used for lots of different use cases in many industries.
37. Creating map tiles in SQL
So far everything we have talked about allows you to query data, but you still have to use a different tool to visualize that data.
The good news is that you can actually generate map tiles from spatial SQL databases. Both PostGIS and BigQuery with CARTO offer this functionality. This means you can make tiles out of any SQL query.
With PostGIS, you can use the ST_AsMVT function to generate the tiles. To serve these to a web app or other service you will likely want to build a microservice to handle that, but there are already lots of great options. Below are some details on creating tiles in PostGIS and two options for serving tile services.
Building tiles in PostGIS
Additionally, in BigQuery, CARTO has developed a tiler that builds tiles directly in SQL, and then serves TileJSON endpoints to consume the tiles in web apps, or directly within other CARTO tools like Builder or CARTO for React.
The aggregated tile function allows you to aggregate millions or billions of points into common aggregations like Quadkeys. Below are the docs and some different examples using the BigQuery Tiler.
Did you know you can also calculate routes at scale in SQL? If you need to build a customized routing service (such as bike routes, indoor routes, or trails) or you need to calculate lots and lots of routes for analysis doing this in SQL is highly efficient.
PostGIS has an extension called pgRouting and their docs have great information on setting up the extension and some use cases that have been built with it. Similarly, CARTO launched a routing service for BigQuery as a part of its Spatial Extension. You can use any network within BigQuery such as OpenStreetMap to build and calculate routes at scale.
39. cleverelephant.ca – Paul Ramsey
If you have made it this far you have definitely seen Paul Ramsey’s name a few times already, and as one of the creators of PostGIS, his blog has tons of content on how to use and optimize PostGIS, new features that are coming, and many more topics.
40. Finding the most remote building in the UK – Simon Wrigley
This is a super unique use case taking building footprints data from the Ordinance Survey and finding the most remote building in the UK all with SQL and using a variety of different methods and analysis. It’s a really practical way to see how you can use just spatial SQL to provide very relevant analysis.