- GIS functions
- Spatial joins
Geospatial data is so widespread that no data science software would be complete without ability to manipulate it efficiently. In this example we will take a look how GIS data can be used in
Apache Spark engine that comes with Apache Sedona extension.
To follow this example checkout
kamu-cli repository and navigate into examples/housing_prices sub-directory.
We will be using following root datasets:
ca.vancouver.opendata.property.tax-reports- contains property assessment reports from the city of Vancouver
ca.vancouver.opendata.property.parcel-polygons- contains geographical outlines of properties
ca.vancouver.opendata.property.block-outlines- contains geographical block outlines
ca.vancouver.opendata.property.local-area-boundaries- contains outlines and names of Vancouver’s districts
For geospatial data
kamu supports several input formats including ESRI Shapefile and GeoJson. Take a look at dataset definitions to understand how data is being ingested.
Quite often when ingesting GIS data you will need to deal with different projections. Example below shows how you can use a pre-processing step to convert data into another projection:
preprocess: kind: sql engine: spark query: > SELECT ST_Transform(geometry, "epsg:3157", "epsg:4326") as geometry FROM input
Note: epsg:4326 is one of the more commonly used projections which is expected by many visualization tools.
ST_Transform function here comes from Apache Sedona extension for the Apache Spark engine for working with GIS data. We will be relying on many of its function throughout this example.
Lets add all these datasets to our workspace and ingest them:
# Create a kamu workspace in the example folder $ kamu init # Add all dataset manifests found in the current directory $ kamu add . --recursive $ kamu pull --all # Alternatively: Pull datasets in ODF format from Kamu's repository $ ./init-s3.sh
tax-reportsdataset is quite large (~400MB in CSV format) so if you don’t want to wait too long - use the
This may take 10+ minutes as the tax report dataset is quite large. You can leave the ingest running and move on to the next step meanwhile.
Creating MapBox Account
For visualizing the GIS data we will be using the awesome Mapbox SDK. To follow the example you will need to create a free account and get your own access token. The token should look something like
Visualizing Housing Prices
Once you have downloaded all datasets we can start with a simple notebook to get a feel for GIS visualization.
Start the notebook server and pass it your Mapbox token via environment variable:
kamu notebook -e MAPBOX_ACCESS_TOKEN=<your mapbox token>
Now you can open the
heatmap.ipynb and run through it executing each step. The notebook is pretty straightforward - it joins together the
tax-reports dataset containing assessment values with
parcel-polygons dataset containing land geometries in order to produce a heatmap.
%%sql CREATE OR REPLACE TEMP VIEW lot_tax AS ( SELECT t.*, l.geometry FROM lots as l JOIN tax as t ON l.tax_coord = t.land_coordinate WHERE t.legal_type = 'LAND' AND t.tax_assessment_year = 2019 AND t.current_land_value is not null )
It then pulls the result from Spark into the notebook itself using this query:
%%sql -o df -n 100000 SELECT land_coordinate, ST_AsGeoJSON(geometry) as geometry, current_land_value + current_improvement_value as current_total_value FROM lot_tax
-n 100000 parameter here overrides the default limit on the dataframe size, while
ST_AsGeoJSON(geometry) also converts the internal binary
geometry type into the GeoJson format. We later use custom-written
df_to_geojson to combine individual geometries into
FeatureCollection objectm understood by the
We can then use the resulting GeoJson data and render the heatmap using the mapboxgl-jupyter library which comes pre-installed with
kamu’s notebook server image.
Let’s take a look how datasets can be joined using the geometries. The
local-area-boundaries contains the names of the Vancouver’s neighbourhoods along with their geographical boundaries. We also have the
block-outlines dataset containing boundaries of individual blocks, but there’s an issue - there is no identifier that would link a block to a specific neighbourhood. For us to associate a block with the neighbourhood we have to see if block’s boundaries are contained or intersect the boundaries of some neighbourhood.
Apache Sedona used by
kamu optimizes such spatial joins.
You can open up the
spatial_join.ipynb notebook and run through it step by step. The most important part ther is the join itself:
CREATE OR REPLACE TEMP VIEW blocks_by_hood AS ( SELECT h.name, b.geometry FROM blocks as b INNER JOIN hoods as h ON ST_Intersects(b.geometry, h.geometry) )
The rest of the visualization steps should be familiar to you from the previous example.
As an exercise - see if you can replace the
median_value column that we have populated with
RAND() for every block with an actual median property price. Just like before, there is unfortunately no identifier that would tie the property to a specific block, but see if
Sedona can handle a massive spatial join between blocks and individual property outlines.