Uncategorized

The 3 critical skills you need for modern GIS (that aren’t Python or SQL)

In GIS we love to collect skills like sports collectors chasing the 2009-10 Panini National Treasures Stephen Curry Logoman Autograph Rookie Card. I admit to this as well, both falling victim to this as well as focusing a lot of content on this as well.

(By the way, the Spatial Lab, my new membership community, is open and we are starting our first shared learning sprint focusing on geospatial data engineering. Learn more here.)

It is true that skills can and do help you in your career in terms of increasing your salary, working with new tools, allowing you more flexibility and creativity in your spatial workflows, and scaling to work with more data. Without tools like SQL and Python a lot of these things wouldn’t be possible.

But there is a big difference between knowing a tool and knowing how to use it and scale it effectively. Let’s take the example of knowing SQL. Being able to write and execute effective SQL statements is one important component. But being able to scale that effectively and do that efficiently, both in your queries, how you implement those systems, and most importantly, how you automate that so that you are not the one pressing the button every time, can significantly increase your productivity and effectiveness with those tools.

In this newsletter I will highlight the three skill areas that I think are the most important to helping you scale and really make use of the skills that you already know. Many of these have to do with primitive concepts that you might learn in some earlier classes in a computer science track, so with that let’s jump in to the first concept.

Data first

‌If you are building any analytical workflow data makes up all your building blocks. And as with any building block, they come in different shapes and sizes. Some fit together well and others just don’t.

Think of your data, each and every entry not just the rows and columns, as building blocks. Most of them in a single column will look more or less the same, and in some columns they will be bigger or different shapes, and that will vary between the columns as well.

Now imagine in your mind that you have one column of data that are BOOLEANs, or true and false values. All the blocks are the same size, but they are a different color, red or green, depending on if they are true and false.

Now do the same with a column of numbers, let’s say integers (whole numbers without decimals) in this case. Each of them are more or less the same size, some are small, some are a bit bigger, and others are bigger still. These change size as the number of digits stored increases. One could do the same with floats, or numbers with decimals as well.

These are your fundamental data types that provide the most storage efficiency. In the lego world they are your pieces that have maybe 1 to 4 dots on the top. Small, efficient, scalable.

Now imagine you have a bunch of pieces that are strings, or text data. These vary a lot in size since they could be one word or many words, plus they need 4 dots just to get started! Many colors, many sizes. Strings are not as efficient since they require a base number of bytes (dots in our Lego analogy) and then data for each additional character! These are your bigger pieces that are larger and bulky, and harder to work with once they are a part of whatever you are building.

Then we have our favorite, geometries. As you know these take on many different types, shapes, scales – from a single point to a polygon with many vertices. The scale is endless here, big and small, detailed and not detailed. It is hard to tell. Geometries can vary a lot (apart from just points because that is just a latitude and longitude – as long as it is 2D). These pieces are the ones that might fit together perfect, or are maybe those odd pieces you don’t know how to use.

Then you have your collections of data, lists or arrays, dictionaries or JSON, sets, and tuples. These contain lots of data and while the data structure is consistent, what is inside them can be super compact and organized, or not (that choice is up to you sometimes).

To give you a better picture of this if the Lego’s didn’t do it for you:

Here’s the revised list, including the GeoPandas Geometry data type, sorted by storage efficiency from smallest to largest footprint:

  1. Boolean (bool) – 1 bit (typically represented as 1 byte due to alignment).
  2. Integer (small) (int) – Small integers typically take up 28 bytes, but this can vary.
  3. Float (float) – Typically 24 bytes.
  4. Complex number (complex) – Typically 32 bytes (two floats).
  5. Short Strings (str) – 49 bytes (plus length of string). A short string (empty string) starts at 49 bytes.
  6. Tuple (tuple) – 48 bytes (empty tuple). Size increases with the number of elements.
  7. List (list) – 64 bytes (empty list). Size increases with the number of elements.
  8. Set (set) – 224 bytes (empty set). Size increases with the number of elements.
  9. Dictionary (dict) – 240 bytes (empty dictionary). Size increases with the number of keys and values.
  10. GeoPandas Geometry (geopandas.GeoSeries or shapely.geometry) – Varies widely based on the complexity and number of vertices in the geometry. A simple point geometry can be lightweight, but complex polygons with many vertices can have a significantly larger footprint.
  11. Large Strings (str) – Large strings consume more memory as they grow.

Now why did I go on a rant about Legos and data types? When you are creating or working with data, you want the most efficient data structure you want. You want it to be consistent and structured, especially when you are slicing, grouping, filtering, and modifying the data. In short, you want smaller pieces rather than bigger pieces when you are doing that, especially if your building (or dataset) is very large.

Lesson 1: use smaller data types and only the data you need to make operations efficient.

Using compute strategically

If data represents your building blocks, then the next most important component that you need are your tools. Picking those tools is one important piece. You wouldn’t use a chainsaw to cut a twig in half, and he wasn’t use a hammer to cut down a tree. For example I wouldn’t use Spark and Apache Sedona to process a Shapefile with 10,000 features just as I wouldn’t use QGIS to process a dataset with 1,000,000,000 features.

The other part about choosing your tools is understanding the energy they consume, and also how they consume it. In this case the energy I am referring to is the compute.

Let’s imagine this scenario. I have to cut down a tree in my backyard. I have a few options to choose from:

  • Push it down with my bare hands
  • Use a hand saw
  • Use an axe
  • Use a chainsaw
  • Call a tree specialist

Each of these requires a few considerations. First how much time is each one going to take. Pushing the tree down is something I can do right away. Getting a tool takes time and some tools are faster than others. And calling a tree specialist will cost time (and money).

Each one takes a certain amount of energy as well. The specialist only requires picking up the phone (but higher cost), where as pushing, using a saw, or an axe requires more energy from you.

In this analogy energy is your computer and your tools are, well, your tools. Different tools will require different energy from your compute resources, be that on your laptop or in a cloud service. And different tools use energy in different ways.

On top of that the tree size can vary. A small tree that is only a few feet tall can get pulled out quite easily, where a much bigger tree would warrant different tools. And some tools use energy more efficiently, such as a chainsaw which makes your operation more scalable.

The lesson here is three-fold:

  • Pick the right tool for the right size tree (dataset)
  • You can control the size of the tree (dataset) based on what we learned in the prior section, which will require tools with less energy (compute)
  • Use scalable tools that increase your productivity by using energy appropriately (compute)

Your compute is an important resource. Deploy it in the right tools for the right size job. If you do this well you can scale efficiently and effectively. Now what are the parallels for all those examples. Well the larger the tree, the larger the dataset. For the rest:

  • Push it down with my bare hands (QGIS)
  • Use a hand saw (Python)
  • Use an axe (SQL)
  • Use a chainsaw (Parallel processing like DuckDB or Apache Sedona)
  • Call a tree specialist (Using a ready to use cloud service)

Automation

All these new things will help you scale to address new problems just by making a few different choices in how you construct your data and how you choose your tools. But what if after all of this you can effectively set up your data and analysis pipelines and have them run without intervention from you?

This is where automation comes in and it helps you deliver value without any extra work from you. I talked about delivering value in a previous newsletter and automation is one of the biggest ways you can do this.

There are quite a few tools to do this. From simple churn jobs on your computer or in the cloud. Data pipeline orchestration in Python with tools like Airflow, Prefect, or Dagster. Job pipeline management in tools like Apache Sedona or Wherobots. Workflow flows with tools like QGIS Model Builder or CARTO Workflows.

But the real magic is in the delivery of the data and value. Imagine that when you share a dataset or dashboard or map, that the end consumer of that analysis knows and can trust that the results are consistent and up to date. And you are delivering value to them without any extra work or having them come back to you to ask for more analysis or to ask another question.

This is the final part of the equation. And while it might be tempting to just modify your workflow to be automated, having an efficient process with the right tools makes all the difference. One example of this was a data pipeline running at Airbnb which contained a file with all the booking data for that day. A parquet file was created each day with that data which totaled 10GB for each day upon which hundreds of downstream tools, analyses, and dashboards relied on.

Needless to say querying that data took a long time, until Zach Wilson took a look at it. By making some simple optimizations (more in this video) he was able to optimize that data to make it only 500 MB per day with the same data volume. And by using the correct tools to query that data every downstream process was 20x more efficient just based on the data, and more because of the correct tools.

This is the end goal, consistent value delivered quickly. And languages and tools alone won’t get you there. But focusing on these three areas will.