Mastering SQL fundamentals to get started with spatial SQL
So you are interested in getting started with spatial SQL — that’s great! Getting started can seem overwhelming but in this post and others, I hope to take you through the different use cases for spatial SQL bit by bit.
One of the biggest challenges if you are working in geospatial technology is that you have to have a little knowledge in a lot of areas: SQL, Python or R, front-end, back-end, data science/machine learning, and many more.
It can seem overwhelming, especially looking at how deep you can go in any one of these topics (try looking for a Python course and you can see all the possible avenues you can go down).
In my last post, I discussed why you should use spatial SQL, and my goal is to explain some of the key concepts in SQL and spatial SQL that will help you get the most out of the language while also providing resources to go deeper as you see fit.
I’ll generally start with some general SQL concepts, then show how they apply to spatial data using spatial SQL.
For these posts, I will assume that the readers have no SQL knowledge so I will try to go step by step and provide some practical examples as well as challenges.
How can I follow along?
There are a few ways you can follow along:
- Using Google BigQuery and BigQuery GeoViz or CARTO
- PostgreSQL and PostGIS (locally or in GCP) with QGIS
Google Cloud Platform (GCP) provides a free tier with (1TB of queries a month which you can use), and all the data we will be using in these examples is already available in BigQuery via the BigQuery public datasets.
Compared to PostGIS, BigQuery has fewer spatial functions but contains many core spatial functions, but at times I may point to a PostGIS-specific function to highlight.
Join other geospatial professionals in my once a month newsletter
BigQuery will also allow us to work with much larger datasets as well. Below are a few videos to familiarize yourself with BigQuery GIS.
Additionally, PostgreSQL has its own “flavor” of SQL, whereas BigQuery uses standard SQL.
If you prefer to use PostgreSQL and PostGIS you can create a PostgreSQL instance in GCP (or your cloud provider of choice). PostGIS will already be installed in GCP and it has a great process for loading CSVs (you will need to create a geometry column from a WKT geometry using ST_GeomFromText if you go this route). You can also use Postgres.app for Mac with the Postico app to provide a query interface. Finally, to visualize data you can connect to QGIS to visualize results.
A note about table names and code styling
Table names will look different in BigQuery and PostGIS. In PostGIS you will see simple table names such as us_states whereas in BigQuery you will see names like this: bigquery-public-data.geo_us_boundaries.states which represent:
- bigquery-public-data: This is the project in GCP
- geo_us_boundaries: This is the dataset, which in BigQuery contains many different tables in it
- states: The actual table we are querying
In the examples, I will use the BigQuery table naming but feel free to replace that if you are using PostGIS.
Finally, in many SQL examples you will see lots of capitalization for words like SELECT, WHERE, GROUP BY, etc. I use all lowercase when writing SQL — it is faster to type (you don’t need to hold down shift all the time) and just as legible.
The goal for this walkthrough
By the end of this post, you should understand:
- How to create a basic SQL query
- An intro to data types in SQL
- Limiting and ordering your data
- Functions in SQL and how to use them
- An overview of different types of spatial SQL functions
- Using spatial SQL for some basic measurements and transformations
Use SELECT and FROM to access your data
To query your data, you will use FROM to pick the table you want to query, and SELECT to select the data from that table. To get all the available columns you can use * to select everything.
If you only want a few columns of data, you can list the column names you want to see, separated by columns.
Use LIMIT and OFFSET to restrict the number of records
Let’s say you only want to grab a few records, maybe the top 10. You can do that using LIMIT:
And if you want to grab records 11 to 20, you can add in OFFSET to grab the next batch:
We can use LIMIT in cases such as limiting the results of nearest neighbors, features within a certain distance, or other spatial relationships.
WHERE allows you to filter records based on data in columns
WHERE allows you to specify conditional statements in your query based on attributes in your data. You could query to find a specific state by name:
Or use a function to find states that start with the letter M:
Or for spatial queries, find states with a total area larger than a specified amount:
WHERE will be used in many of your spatial queries: to find features that intersect one another, a feature within a specific distance, and more.
You can also combine WHERE statements using the AND and OR operators depending on how you want to filter your data.
Use mathematical operators in your queries
As shown above, you can use any number of operators in your WHERE clauses as well as performing mathematical operations within your query:
You can use ORDER to order your data
If you want to order the results of your query, you can use ORDER BY to order your data by a specific value in the data. You can sort these in different ways using DESC and ASC depending on the direction you want to go.
For spatial queries, ORDER BY allows us to sort data based on spatial features or relationships (amount of features within a certain distance)
SQL has different data types and you can change between them
SQL provides for many different data types as well and it is important to know about them as well as some of the nuances in different types of SQL. For BigQuery this includes
PostgreSQL/PostGIS includes even more.
A common problem found in spatial data is with postal codes. Postal codes in the US include some states with a leading 0, and if that is stored in a numeric data type those leading 0s will be lost — so 00123 becomes 123.
SQL allows us to cast between data types, so we can turn strings into numbers and many other combinations.
For spatial SQL we can turn our GEOGRAPHY or GEOMETRY columns into other formats such as WKT or GeoJSON — which uses specific functions which we will focus on next.
SQL uses functions to transform and work with data
SQL provides different functions that are used to transform, aggregate, and manage data (among many other things. Generally speaking, these functions take any number of arguments and will return some result.
We saw one above — LEFT()
It is important when using these functions to read the documentation to see what the specific inputs are as well as the value(s) that are returned. For the function LEFT() takes two arguments, first a value and second a length, and it will return:
…a STRING or BYTES value that consists of the specified number of leftmost characters or bytes from value
So if we wanted the first letter of the state name we can run this query:
You can use functions to add more columns to your dataset, in WHERE statements and ORDER BY (with some limitations), and more. When you use a function as a new column you can easily add an alias after function to give it a new column name as seen above.
Spatial functions (almost) always start with ST
I have shown a few different spatial SQL functions here, and as you will see they all begin with ST, which stands for spatial type. This helps distinguish SQL functions like UNION from spatial functions with similar names — ST_Union.
There are some helper functions and shorthand functions in PostGIS that don’t start with ST, but for the vast majority, this rule holds true.
Practicing some spatial SQL basics
Now that we have covered the basics let’s run a few different spatial SQL queries to practice what we have covered.
First, let’s build on some of the queries from earlier:
Let’s now order our states by the size of their border using ST_Perimeter:
Next, let’s transform our state polygon geometries to centroids:
And now, by combining two functions, we can get the latitude and longitude coordinates of those centroids:
And using another package, the Spatial Extension from CARTO, we can also create buffers from those same centroids:
We can also see, for a specific state, which other states are touching that state:
Finally, for those working in PostGIS, a quick note on projections, which can be changed by ST_Transform (BigQuery only uses ESPG:4326):
The great thing about BigQuery is that any of these queries can scale up to large data, such as for every Census Block in the US:
Spatial SQL: Challenge
Using the Mapzen Who’s on First dataset in BigQuery, let’s look for airports located in Canada, and order them by their airport name and their south to north location
The elements you should think about are
- Filtering results that are in Canada
- Ordering by south to north (hint — we showed a function earlier in the post that will help you achieve this)
- Ordering by the airport code
First, you can create a table that contains the information you need using this query (don’t worry we will cover this in future posts)
You can also try this yourself! Using a dataset of your choice, measure the geometry using any function of your choice. Try to use a WHERE clause to select data based on a specific filter — maybe based on a text or numeric column. Share your queries and results by adding a comment in the post.
I will post the answer to this challenge in the next post where we will look at advanced WHERE statements, aggregations, and more!