this is aaronland

things I have written about elsewhere #2022129

A global point-in-polygon service using a static 8GB data file

This was originally published on the SFO Museum Mills Field weblog, in December 2022.

Here's the punchline: A global point-in-polygon service that returns Who’s On First records and costs a few dollars a month to run. To explain how that's possible let's start with a video.

If your browser is making the video very small in order to fit the viewport you can see a full-size version here.

This video shows some in-progress work we're doing to develop interfaces for geotagging objects in our collection. This example shows an aerial photo of Yerba Buena Island and the San Francisco Bay taken by Stanley Henry Page in 1922. Note the absence of the Bay Bridge which wouldn't be completed for another 14 years! All of this work builds on the many geotagging-related blog posts written in 2020 and, in particular, our use of the New York Public Library's Leaflet.GeotagPhoto interface control to define not just where an image was taken but also the field of view emcompassed by that image.

Recently, we've been working on the tools to reverse-geocode those two points. Reverse-geocoding assigns stable, permanent identifiers for the named places represented by a pair of latitude and longitude coordinates. We've written about this before in the Reverse-Geocoding in Time at SFO Museum and Geotagging at SFO Museum: Protomaps, search and reverse-geocoding blog posts. In both cases we were working with location data and databases scoped to SFO and the San Francisco Bay Area but SFO Museum's collection spans the entire planet so the challenge has been to think about inexpensive and low-overhead ways to make reverse-geocoding possible for any place on Earth. The results of those efforts are what you're seeing in the video above.

The video starts by showing the image of Yerba Buena next to a map. The default starting point for objects that haven't been geotagged is the SFO campus so the first thing you see is a demonstration of what happens as the camera and field of view controls are moved around. Each time a control stops moving its latitude and longitude is sent to a reverse-geocoding server and the list of places associated with that point are used to populate a drop-down menu. This allows us to say that an image (not this image) was taken from the Super Bay Hangar building and that it depicts Runway 01R/19L.

Next the camera icon is moved up to the City of San Francisco, and positioned where we think the photograph (of Yerba Buena island) was taken, and the field of view control is extended to stretch across the San Francisco Bay. Downtown San Francisco and the City of Berkeley are selected as the locations where the image was taken and what it depicts, respectively. Is this correct? The photo wasn't taken from downtown San Francisco so much as above it and the focus of the photo is really Yerba Buena more than Berkeley. But, as you can see from the video, there are a number of places to choose from and we can move the field of view control to better suit our understanding of an image. Failing a consensus about the exact locations we could always say the photo depicts San Francisco and Alameda counties or, if nothing else, at least the state of California.

Afterwards, the map is zoomed out to display the Americas. The camera is moved up to Canada and the field of view control to the Roraima region in Brazil and then across the ocean to Africa and finally to Central Europe where there are both contemporary and historical records for the countries that made up the former Yugoslavia. These unrealistic camera views are just to demonstrate that it's possible to perform point-in-polygon queries anywhere in the world. As a place is selected from the results of a point-in-polygon query its unique ID, and all the unique IDs of its ancestors, is appended to the coordinate data for the camera and the field of view controls. These are data we then append to the object records in our collection which enables us to search for objects associated with not just a specific location but all the other places that parent it.

These point-in-polygon requests are being made to a server that has been deployed using the API Gateway in front of a Lambda function pattern. That's a pretty common pattern these days and not terribly exciting (or at least novel, I think it's still pretty exciting). The exciting part is that the server is executing the point-in-polygon queries against a static 8GB datafile stored in an S3 bucket rather than a traditional spatial database. That static datafile contains the entirety of the Who's On First administrative corpus in addition to the catalog of architectural features at the San Francisco International Airport and was created using openly-licensed data and open-source software.

A screenshot of the AWS Cost Estimate Calculator showing estimated costs for hosting an 8GB file using their S3 storage service.

This datafile can be stored and served from an AWS S3 bucket for as little as $0.18/month to as much as $9/month if, unlikely as it may be, you were to serve 8000GB worth of data out of that datafile in a month. So, really, a global point-in-polygon service that returns Who's On First records for about $1/month. All of this builds on work we wrote about in the Reverse-Geocoding in Time at SFO Museum blog post. In fact the server piece described in that blog post is the same one being used in this example. The difference is that instead of storing data and performing spatial queries in a SQLite database we are storing data in a remote Protomaps datafile and reading small slices of that file, rather than the whole thing, in order to extract just the information necessary to perform a spatial query in memory.

So, what is a Protomaps database? Protomaps describes itself as a serverless system for planet-scale maps. It is a suite of tools and a database format for efficiently storing, retrieving and rendering map tile data at a fraction of the size and cost of other map-related infrastructures. The Protomaps database format employs clever means to organize and efficiently retrieve map tile data but that data is, ultimately, just a collection of compressed GeoJSON features. This means we don't have to store data designed specifically for rendering cartographies but can store other geographic data like Who's On First place records. For any given tile there are only a finite number of places and their geometries will have been trimmed to the boundary of the tile so, assuming data has been stored at zoom 12 tiles or higher, the amount of data to transfer out of a Protomaps database will be small (or at least small enough). All of this makes it possible, given a latitude and longitude coordinate, to:

This approach is not as fast as other approaches. Notably it is nowhere near as fast as an actual spatial database. However, when you factor in the cost and the overhead of running and maintaining a dedicated spatial database, combined with a variety of strategies for caching results, it is very often fast enough. This is not the only way we could have solved the problem of a global-scale reverse-geocoding service but we are happy to be able to have done so in a way that puts this kind of functionality within the reach of people and organizations who might not otherwise be in a position to deploy, or maintain, the traditional infrastructure used to solve these kinds of problems.

Postcard: Eastern Air Lines, Douglas. Paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2015.166.0636

How does it work?

This section documents the different data and software packages which make all of this possible. This is the Very Technical part of the blog post so if that's not relevant to your interests you can stop here knowing that something which, at the scale of every place on Earth, used to be hard and expensive is now cheap, easy and possible.

The rest of this blog post will document each of the steps used to create a Protomaps database full of Who's On First records and the query it using both a command line tool and an HTTP web service.

Who's On First data

Poster: various airlines. Paper, ink. Gift of Robert McCrory, SFO Museum Collection. 2010.157.003

Who's On First data is the raw geographic data that we're going to encode, for example one or more data repositories from the whosonfirst-data or the sfomuseum-data GitHub organizations.

whosonfirst/go-whosonfirst-tippecanoe

Photograph: San Francisco International Airport (SFO), Central Terminal construction. Photograph. Transfer from San Francisco International Airport, SFO Museum Collection. 2005.137.050.002

go-whosonfirst-tippecanoe is a software package that will iterate through one or more Who's On First data repositories, filter out records whose geometry is not a polygon, trim the record properties to a minimum, common set of properties (optionally appending additional properties you choose to specify) and emitting everything as line-separated GeoJSON records.

For example, here's how you might create line-seperated GeoJSON records for features in the sfomuseum-data-whosonfirst and sfomuseum-data-architecture respositories:

$> bin/features \
	-as-spr -require-polygons \
	-spr-append-property wof:hierarchy \
	-writer-uri 'constant://?val=featurecollection://?writer=stdout://' \
	/usr/local/data/sfomuseum-data-whosonfirst/ /usr/local/data/sfomuseum-data-architecture \
	> sfomuseum.jsonl

Do you see the -spr-append-property wof:hierarchy flag that's being included in the command above? This tells the tool to append the wof:hierarchy property of each feature to the final output. We'll revisit that later in this blog post.

felt/tippecanoe

Swizzle stick: American Airlines. Plastic. Gift of Thomas G. Dragges, SFO Museum Collection. 2001.116.189

tippecanoe is a software package that consumes GeoJSON records and encodes their data as map tiles in an MBTiles database. For example, here's how you might do that for the sfomuseum.jsonl file created in the last example, encoding all the data in tiles at zoom level 13.

$> tippecanoe -P -z 13 -pf -pk -o sfomuseum.mbtiles sfomuseum.jsonl

tippecanoe is also able to read data from STDIN to rather than creating an intermediary sfomuseum.jsonl file you can also just pipe the output of the features tool straight to tippecanoe. For a good discussion of how tippecanoe works take a look at Erica Fischer's How We Make Your Data Look Great At Every Scale with Tippecanoe blog post.

protomaps/go-pmtiles

Playing cards: Western Airlines. Cardboard, plastic, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2002.095.044

go-pmtiles is a software package that will convert a MBTiles database of map tiles in to a Protomaps database (of map tiles). Here's how you would convert the 'sfomuseum.mbiles' database created in the last example in to a Protomaps database.

$> pmtiles convert sfomuseum.mbtiles sfomuseum.pmtiles

whosonfirst/go-whosonfirst-spatial-pmtiles

Brochure: Northeast Airlines. Paper, metal, ink. Gift of the William Hough Collection, SFO Museum Collection. 2008.055.992

go-whosonfirst-spatial-pmtiles is a software package for performing point-in-polygon queries against a Protomaps database (of map tiles). In addition to library code for performing those queries it also provides tools for building a Protomaps database of Who's On First data in a Docker container and a command-line tool for doing point-in-polygon queries against a local, or remote, Protomaps database.

docker

The Dockerfile included with the package wraps all the steps described above and can be run locally or in a hosted container service. For example, here's how you might build the same Protomaps database, that we've created in the examples above, storing the output locally, in a folder named /usr/local/data.

$> docker build -t whosonfirst-spatial-pmtiles .

$> docker run -v /usr/local/data:/usr/local/data whosonfirst-spatial-pmtiles  \   
	/usr/local/bin/build.sh \
	-n sfomuseum \
	-z 13 \
	-s 'sfomuseum-data://?prefix=sfomuseum-data-whosonfirst sfomuseum-data://?prefix=sfomuseum-data-architecture'

Note that this particular example might take a long time depending on things like bandwidth speeds and processing power; anything map-related over zoom 12 starts to take a long time. It is also written specifically to read source data from repositories stored in a GitHub organization, rather than data stored locally. There are also example configuration files for running this container using Amazon's Elastic Container Service.

query

The query tool is a command-line tool for performing point-in-polygon queries against a Protomaps database. Here's how you would query the sfomuseum database, created in examples, above for a point at SFO. In this example we're also asking that the results be sorted first by placetype, then by name and finally by the inception date. The output of that query is run through the jq tool to produce a simple list containing each place's name, inception date and placetype.

$> ./bin/query \
	-spatial-database-uri 'pmtiles://?tiles=file://usr/local/data&database=sfomuseum' \
	-latitude 37.615761 \
	-longitude -122.389154 \
	-sort-uri placetype:// \
	-sort-uri name:// \
	-sort-uri inception:// \
	
	| jq '.places[] | "\(."wof:name") \(."edtf:inception") (\(."wof:placetype"))"'
	
"North America  (continent)"
"United States  (country)"
"California  (region)"
"San Mateo  (county)"
"San Francisco International Airport 1948~ (campus)"
"SFO Terminal Complex 2000~ (building)"
"SFO Terminal Complex 2006~ (building)"
"SFO Terminal Complex 2011~ (building)"
"SFO Terminal Complex 2014~ (building)"
"SFO Terminal Complex 2017~ (building)"
"SFO Terminal Complex 2019-07-23 (building)"
"SFO Terminal Complex 2020-~05 (building)"
"SFO Terminal Complex 2021-05-25 (building)"
"SFO Terminal Complex 2021-11-09 (building)"
"International Terminal 2000~ (wing)"
"International Terminal 2006~ (wing)"
"International Terminal 2011~ (wing)"
"International Terminal 2020-~05 (wing)"
"International Terminal 2021-05-25 (wing)"
"International Terminal 2021-11-09 (wing)"
"International Terminal 2000~ (wing)"
"International Terminal 2006~ (wing)"
"International Terminal 2011~ (wing)"
"International Terminal 2014~ (wing)"
"International Terminal 2017~ (wing)"
"International Terminal 2020-~05 (wing)"
"International Terminal 2021-05-25 (wing)"
"International Terminal 2021-11-09 (wing)"
"International Terminal 2000~ (wing)"
"International Terminal 2006~ (wing)"
"International Terminal 2011~ (wing)"
"International Terminal 2014~ (wing)"
"International Terminal 2017~ (wing)"
"International Terminal 2019-07-23 (wing)"
"International Terminal 2020-~05 (wing)"
"International Terminal 2021-05-25 (wing)"
"International Terminal 2021-11-09 (wing)"
"International Terminal Main Hall 2017~ (concourse)"
"International Terminal Main Hall 2019-07-23 (concourse)"
"International Terminal Main Hall 2020-~05 (concourse)"
"International Terminal Main Hall 2021-05-25 (concourse)"
"International Terminal Main Hall 2021-11-09 (concourse)"

As you can see any given point on the SFO campus could have been one of any number of different concourses or buildings over time. We've written about this in detail in the Who's On First at SFO Museum and the Reverse-Geocoding in Time at SFO Museum blog posts. Although it's not shown in this example it's also possible to filter point-in-polygon queries using fuzzy or ambiguous dates encoded using the Library of Congress's Extended DateTimeFormat. You can also see how we are able to pair SFO Museum's bespoke data alongside the core Who's On First dataset.

whosonfirst/go-whosonfirst-spatial-www-pmtiles

Postcard: Pan American World Airways. Paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2015.166.1213

go-whosonfirst-spatial-www-pmtiles is a software package for running a web service that allows point-in-polygon queries to be performed against a Protomaps database using HTTP requests. This server can be deployed as a standalone HTTP service or using the API Gateway in front of a Lambda function pattern. Here's how you might start that service, running locally, using the Protomaps database created in the examples above.

$> bin/server \
   	-server-uri http://localhost:8080 \
	-spatial-database-uri 'pmtiles://?tiles=file:///usr/local/data&database=sfomuseum&enable-cache=true' \
	-properties-reader-uri '{spatial-database-uri}' \

The service is queried by sending it a JSON-encoded point-in-polygon request to its /api/point-in-polygon endpoint. For example:

{
   "latitude" : 37.615761,
   "longitude" : -122.389154,
   "properties" : [
      "wof:hierarchy"
   ],
   "sort" : [
      "placetype://",
      "name://",
      "inception://"
   ]
}

Before going any further there are a few things to note:

Eventually, all of these steps will be consolidated but they are necessary today. Putting it all together we end up with something like this:

$> curl -X POST http://localhost:8080/api/point-in-polygon \
	-d '{"latitude": 37.615761, "longitude": -122.389154, "properties": [ "wof:hierarchy" ], "sort": [ "placetype://", "name://", "inception://" ] }'


{
  "places": [
    {
      "edtf:cessation": "",
      "edtf:inception": "",
      "mz:is_ceased": -1,
      "mz:is_current": 1,
      "mz:is_deprecated": -1,
      "mz:is_superseded": 0,
      "mz:is_superseding": 0,
      "mz:latitude": 0,
      "mz:longitude": 0,
      "mz:max_latitude": -122.433347,
      "mz:max_longitude": -122.342023,
      "mz:min_latitude": 37.578044,
      "mz:min_longitude": 0,
      "wof:belongsto": [],
      "wof:country": "",
      "wof:id": "102191575",
      "wof:lastmodified": 1652218109,
      "wof:name": "North America",
      "wof:hierarchy":[
        {
            "continent_id":102191575
        }
      ],
      "wof:parent_id": "-1",
      "wof:path": "102/191/575/102191575.geojson",
      "wof:placetype": "continent",
      "wof:repo": "sfomuseum-data-whosonfirst",
      "wof:superseded_by": [],
      "wof:supersedes": []
    },
    ...and so on

Reading remote datafiles

Serving tray: United Airlines; McDonnell Douglas DC-10. Plastic, paint. Gift of the McAfee family in memory of Robert E. McAfee, SFO Museum Collection. 2015.126.121

So far all the examples have demonstrated reading data from a local Protomaps database. It is also possible to read data from a Protomaps database hosted on a remote storage service like Amazon's S3. Under the hood the value of the ?tiles= query parameter in the -spatial-database-uri flag is just a gocloud.dev/blob URI so data can be read from any source that implements that package's Blob interfaces. The default supported URIs are file:// for reading data from the local filesystem and s3:// and s3blob:// for reading data from an S3 bucket. The difference between s3:// and s3blob:// is that the latter allows AWS credentials to be defined as a query parameter. For example:

s3blob://{S3_BUCKET}?region={AWS_REGION}&prefix={PREFIX}&credentials={CREDENTIALS}

A concrete example, reading data from an S3 bucket called example in a folder called tiles/ in the AWS us-east-1 region, specifying an AWS credentials profile called example would be:

s3blob://example?region=us-east-1&prefix=tiles/&credentials=example

Unfortunately, because the URI contains its own query parameters, it needs to be URL-encoded before it is included in the final -spatial-database-uri URI. Putting it altogether we end up with this:

-spatial-database-uri 'pmtiles://?tiles=s3blob%3A%2F%2Fexample%3Fregion%3Dus-east-1%26prefix%3Dtiles%2F%26credentials%3Dexample&database=sfomuseum'

Which is definitely not pretty but it does the job, provides a simple declarative syntax to toggle between local and remote data sources and is something that can be improved on in future releases.

Postcard: San Francisco International Airport (SFO), The International Room. Paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection. 2015.166.2287

A few final words about cached data

It is possible to tell the server tool in the go-whosonfirst-spatial-www-pmtiles package to return results as a set of GeoJSON records. This functionality is not really suited for use with a Protomaps database. That's because the geoemtries for any given place are trimmed to the boundaries of the map tile that contains it. Large areas, like regions or countries or continents, will simply be squares unless you happen to be doing a point-in-polygon query near a coastline and even smaller places will be trimmed in they happen to span multiple tiles. Compounding the issues, as of this writing, is that the local caching mechanism is storing feature records using only their Who's On First identifiers independent of the tile currently being queried. As you can see in these screenshots that yields some puzzling results. An improved caching strategy is on the to-do list but, ultimately, if you are using a Protomaps database for doing point-in-polygon requests and you need complete geometries for the places returned in a query request you will need to fetch them in separate requests from https://data.whosonfirst.org/geojson/{ID} or https://static.sfomuseum.org/geojson/{ID} or an equivalent data source.

Update

Since publishing this blog post there have been some questions about whether or not SFO Museum is operating a public point-in-polygon service. At this time we are not. However, in the spirit of generousity we have produced and published a public Protomaps database file containing all of the administrative records in Who's On First at zoom level 12. The file itself is 3.8GB and unfortunately we need to reserve the right to remove it, at any time, if the bandwidth costs become an issue. Maybe it won't but, sadly, in 2022 it's always a possibility. For the time being here's how you can download and use the file locally:

First, start by downloading the file using a tool like wget or equivalent. In these example it is assumed the local file is saved to /usr/local/data/whosonfirst.pmtiles.

$> mkdir -p /usr/local/data
$> wget -O /usr/local/data/whosonfirst.pmtiles https://static.sfomuseum.org/pub/point-in-polygon/whosonfirst.pmtiles

If you just want to query the database file from the command line download a copy of the go-whosonfirst-spatial-pmtiles package. The only thing you'll need to install to compile the query tool is a copy of the Go compiler which will build the tool for any of the Windows, Apple or Linux operating systems.

$> cd go-whosonfirst-spatial-pmtiles
$> go build -mod vendor -o bin/query cmd/query/main.go

And then query the database just like you would in the examples above:

$> ./bin/query \
	-spatial-database-uri 'pmtiles://?tiles=file:///usr/local/data/&enable-cache=true&database=whosonfirst' \
	-latitude 37.615761 \
	-longitude -122.389154 \
	-sort-uri placetype:// \
	| jq '.places[]["wof:name"]'
	
2022/12/20 11:28:35 time to index paths (0) 20.53µs
2022/12/20 11:28:35 GET tile at /whosonfirst/12/655/1585.mvt
2022/12/20 11:28:35 fetching whosonfirst 0-16384
2022/12/20 11:28:35 Prune tile cache older that 2022-12-20 11:23:35.357902 -0800 PST m=-299.971377764
2022/12/20 11:28:40 fetched whosonfirst 0-0
2022/12/20 11:28:40 fetching whosonfirst 3003051-11907
2022/12/20 11:28:40 fetched whosonfirst 3003051-11907

"North America"
"United States of America"
"United States"
"California"
"San Mateo"
"San Francisco International Airport"
"San Francisco-Oakland-San Jose"
"America/Los_Angeles"

If you want to query the database file from a web service download the go-whosonfirst-spatial-www-pmtiles package. Like the previous example, the only thing you'll need to build a copy of the server tool is the Go compiler.

$> cd go-whosonfirst-spatial-www-pmtiles
$> go build -mod vendor -o bin/server cmd/server/main.go

$> ./bin/server \
	-spatial-database-uri 'pmtiles://?tiles=file:///usr/local/data&enable-cache=true&database=whosonfirst' \

The default point-in-polygon endpoint for the server tool is http://localhost:8080/api/point-in-polygon. For example:

$> curl -s http://localhost:8080/api/point-in-polygon \
	-d '{ "latitude": 37.615761, "longitude": -122.389154, "sort": [ "placetype://" ] }' \
	| jq '.places[]["wof:name"]'
	
"North America"
"United States of America"
"United States"
"San Francisco International Airport"
"America/Los_Angeles"
"San Mateo"
"San Francisco-Oakland-San Jose"
"California"