this is aaronland

things I have written about elsewhere #20180102

Who's On First, Chapter Two

This was originally published on the Who's On First weblog, in January 2018. I've included it here because it is a fitting end to 2017.

The short version

We heard you liked 2017 so much that we decided to start 2018 with a bang...

The longer version

So that's a bummer, yeah?

In many ways everything about the manner in which Who's On First has been designed has been done with this day in mind. We all endeavour to achieve the sort of escape velocity that immunizes us from circumstance but that is rare indeed and there was always a chance this day would come. So while success was the goal in many ways preventing what I call the reset to zero has always been of equal importance.

The reset to zero means that one day (like today) the Who's On First you use and pay for as a service on the Internet goes away... and then what?

Typically it's meant that anyone who's wanted to use an alternative to one of the Big Dominant Geo-Services or API Providers (BDGSAP) and has built their applications or services using that alternative has had little choice but to go fishing around in their desk drawer for their BDGSAP account information and think about re-tooling everything... all over again.

That's the reset to zero.

I would like to think that we have, even just a little bit, prevented this from happening again.

I would be nice to believe that it's only been a reset to five (on a scale to ten) but realistically it's probably closer to a reset to three.

That does not mean that people will be able to turn around and build or offer all of the Who's On First services in a day or even a week but if we've done our jobs well I'd like to think that there is nothing that couldn't be rebuilt (by us or by someone else) in a couple of months. It means that while things are not literally better than yesterday – since yesterday you didn't have to read this blog post – it means that things are hopefully better than the yesterday of the last time a service you came to depend on had to shutter its doors.

We'll see, right? The dust hasn't settled yet so there are likely gotchas that remain to be discovered.

The long version

In practical terms what this means for Who's On First in the short-term is:

What's next

A lot of people have been asking me what's next and I've generally answered by saying that we'll take some time to lick our wounds (and our bruised egos) and then we'll pick things up again and figure out how to keep it going. But it is work that will have to be done in layers, rebuilding things piece by piece, which is to say:

  1. Make sure the raw data is available, period. This is mostly done, albeit in human un-friendly forms.

  2. Make sure each record is available as an addressable resource (forgive me, I just said resource...) on the web. Again, we have one instance of this (data.whosonfirst.org) but the more the merrier.

  3. Spin up the Spelunker again along with its great big Elasticsearch index.

  4. Build something akin to the recently announced AWS S3 Select API which allows you to extract individual, specific properties from S3 documents. This is meant to be a bridge between the Spelunker and the formal API (below). I have a working prototype for this, today.

  5. Spin up the API servers, without any spatial functionality enabled.

  6. Spin up the point-in-polygon services and re-enable them in the API.

  7. Figure out where and how to rebuild the other spatial queries blah blah blah databases blah blah blah scale blah blah blah computrons blah blah blah... and re-enable them in the API.

  8. Rebuild the editorial tools (Boundary Issues) and figure out how to improve upon the batch/cascading updates. This might also be the right time, or opportunity, to think about how we open up the editorial process more broadly to people on the Internet.

  9. It feels like there is enough work in Who's On First, past present and future, to warrant a 3-4 year granted funded project (especially if we aim to make a dent in the historical places problem) but that's not anything that will happen right away.

In my mind everything up to and including the Spelunker is near-term, the API and the spatial services are medium-term and the editorial stuff is longer-term. It's not ideal but it seems the most realistic given whatever world of new everyone involved in the project will be negotiating during the coming year.

Aside from the near-term work we're doing ourselves (which we'll talk more about below, I promise) we've been in touch with a handful of institutions to see whether they are interested in helping out. In the short term what this has meant is asking whether they would be interested in hosting a static copy of the data (2) and donating a spare computer or two to run the Spelunker and Elasticsearch (3).

Simply hosting the data alone would be fantastic since it would mean that people already relying on existing Who's On First documents will only need to update their URLs to point to whosonfirst.INSTITUITION.org/data (or whatever).

Sidenote: In case you've ever wondered why Who's On First defaults to publishing relative URLs for things, this is why...

Running a copy of the Spelunker would be icing on the cake. Everything else seems premature, right now. If you or your institution (or your company) is interested in helping out the dataset consists of 26.5M files and requires approximately 60GB of storage without any Git history data. With Git history data the storage requirements are approximately 300GB. The Elasticsearch index is about 130GB today so let's just round it up to a nice and easy 200GB.

./bin/wof-du-stats -mode repo /usr/local/data/whosonfirst-data* | python -mjson.tool
{
    "stats": {
        "0-10k": 26336534,
        "1-10m": 428,
        "10-100k": 170186,
        "10-100m": 30,
        "100-500k": 14563,
        "100m-BIG": 1, <-- oh New Zealand...
        "500k-1m": 827
    },
    "totals": {
        "bytes": 58514102879,
        "files": 26522569
    }
}
	      

I tend to think that keeping more copies of things (2,3) in more places is a good thing and also serves as an opportunity to figure out how make them play nicely together across multiple institutions. That's a larger meta-project separate from Who's On First but equally important in its own right.

In the meantime

Here's what we've been doing in the meantime:

curl -s -I https://id.whosonfirst.org/101736545 | grep Location
	      Location: https://data.whosonfirst.org/101/736/545/101736545.geojson
	      

This is simultaneously an effort to reduce costs, minimize maintenance and overhead and generally make sure that Who's On First plays nicely with all the tools and services. In the same way that Who's On First tries to actively not-care about what sort of database you want to use we try equally to have no opinion about what sort of infrastructure stack you're using. It is early days so it remains to be seen how many of those goals are actually met.

There is also in-progress work to build an on-demand service for publishing static renderings of Who's On First documents. These include HTML, SVG and standard place response (or SPR) JSON responses. These static renderings are meant to serve as a bridge between the raw data files and services like the Spelunker or the API and live under the places.whosonfirst.org domain.

For example, the URL for New York City is https://places.whosonfirst.org/id/85977539 and to fetch the SPR response you would do:

curl -s https://places.whosonfirst.org/id/85977539.spr | python -mjson.tool
{
    "mz:is_ceased": -1,
    "mz:is_current": 1,
    "mz:is_deprecated": 0,
    "mz:is_superseded": 0,
    "mz:is_superseding": 1,
    "mz:latitude": 40.68295,
    "mz:longitude": -73.9708,
    "mz:max_latitude": 40.915532777005,
    "mz:max_longitude": -73.700009063871,
    "mz:min_latitude": 40.496133987612,
    "mz:min_longitude": -74.255591363152,
    "mz:uri": "https://whosonfirst.mapzen.com/data/859/775/39/85977539.geojson",
    "wof:country": "US",
    "wof:id": 85977539,
    "wof:lastmodified": 1511825453,
    "wof:name": "New York",
    "wof:parent_id": -4,
    "wof:path": "859/775/39/85977539.geojson",
    "wof:placetype": "locality",
    "wof:repo": "whosonfirst-data",
    "wof:superseded_by": [],
    "wof:supersedes": [
        1125397311
    ]
}
	      

A few things to note about this example:

  1. It probably makes more sense to use a .json extension, rather than .spr.
  2. See the mz:uri property? Depending on when you read this that might have been replaced with a wof:uri property already. I was (still am) always ambivalent about including that property in the so-called standard places response for just this reason.
  3. It would be nice to update the id.whosonfirst.org service to redirect you to the correct HTML, SVG, etc. endpoint based on content headers or equivalent hints.

I mentioned there are also SVG versions of all the geometries. This is still experimental work so we haven't formalized how it will be operationalized or sorted out all the kinks. My hope is that, in time, we can generate canned PNG versions from these SVG renderings for use in applications that just need a picture of a place rather than the raw data.

That's as far as we've gotten, today. As you might imagine it's been a little crazy around here and there's been a lot of other stuff to deal with.

How you can help (with the software)

If you'd like to help out on the software side of things, here's a short list of things (in no particular) order that I've been thinking about:

How you can help (with the data)

I asked Stephen Epps to outline some ways that people can help out with the data side of things going forward. This is what he said:

Geometry fixes

Data additions

Property edits

Other issues

Tools

To all of that I would add:

Inspirational conclusion

It's a disappointing day for sure, but it has been both a luxury and a privilege to work 40 hours a week on Who's On First for this long. Importantly, things are just a little bit better in January 2018 than they were in July 2015 when we started the project.

We all know that a comprehensive, high quality and openly licensed gazetteer is a need and a benefit to all. Most people, though, have no idea how difficult a problem those three things are to tackle simultaneously and nor should they. There is a lot of work left to do but my hope is that we have contributed enough to make a dent in the problem at least deep enough for the next person or persons to get a toe-hold and carry things forward still without having to start from scratch.