this is aaronland

generative dependencies

go-marc

Flight information packet: American Airlines. Collection of SFO Museum. 2013.088.002 a l.

A couple of weeks ago I attended a presentation about the navPlace extension to the IIIF APIs. The extension defines the semantics for appending location information to IIIF resources, typically old historical maps, as GeoJSON FeatureCollection elements. At one point during the presentation someone pointed out that there are a lot of institutions with maps lacking geographic information or that only have geographic information stored in MARC (or machine-readable cataloging) records. Usually that information is stored in 034 - Coded Cartographic Mathematical Data fields that look like this:

1#$aa$b80000$dW0825500$eW0822000$fN0273000$gN0265000

That's the MARC 034 encoding for a bounding box containing the area spanning Bradenton and Venice along Florida's Gulf cost and I was reminded that I had actually written code for working with these data.

In 2017 I spent a few days with the DX Labs crew at the State Library of New South Wales in Sydney. Mapzen was still a going concern then and I was actively working on the Who's On First gazatteer project. The Library has an extensive collection of maps of the greater Sydney area. Many of these maps aren't geographically accurate, taking artistic license in their depiction of space, but over the years have been assigned a geographic extent (a bounding box) based on what is visible in the map.

Because I was working on a gazetteer at the time my first thought was that it would be interesting to find all the places, (the neighbourhoods, microhood, localities) that intersected the extent of any given map. That would allow the library to assign persistent identifiers for specific places and enable spatial lookups of their map collection without necessarily having to do perform spatial operations or manage a database of spatial information necessary to perform those queries.

Traveler information brochure: Air France. Collection of SFO Museum. 2011.001.069

In order to do this it was first necessary to account for the fact that all the library uses MARC 034 fields for storing all their location information. It is surprisingly easy to find reasons to get exasperated by MARC records but they are also a reflection of the technological constraints that people were operating under when they were created. More importantly they worked well enough that, to this day, they continue to work well enough that nothing else has superseded them.

MARC itself has grown to become a large and daunting specification. There are a small handful of general purpose tools for working with MARC records but lacking the time I chose to write a single purpose-fit tool for parsing MARC 034 records in to bounding boxes with decimal coordinates and from there in to GeoJSON Feature records.

My thinking was that to ask the library to change its existing practice, and by extension update its entire catalog retroactively, was both a non-starter and presumptuous. The library had settled on MARC 034. Few, if any, contemporary spatial tools work with MARC 034. What was necessary was something simple to install and easy to use that could bridge those two facts. Which is how the go-marc tools came to exist. The tools consist of a software library for parsing MARC 034 records and tools, that use those libraries, for working with documents containing MARC 034 records.

There is a tool for converting arbitrary MARC 034 strings in to decimal bounding boxes. For example:

$> ./bin/marc-034 '1#$aa$b22000000$dW1800000$eE1800000$fN0840000$gS0700000'
-70,-180,84,180

There is a tool for reading one or more CSV files containing MARC 034 column data and appending the corresponding decimal bounding box information to a new CSV file. For example, given a CSV file that looks like this:

$> cat test.csv
id,marc_034,name
123,1#$aa$b22000000$dW1800000$eE1800000$fN0840000$gS0700000,example
456,1#$aa$b80000$dW0825500$eW0822000$fN0273000$gN0265000,another example

The tool in question would produce the following output:

$> ./bin/marc-034-convert -to-stdout ./test.csv
id,marc_034,max_x,max_y,min_x,min_y,name
123,1#$aa$b22000000$dW1800000$eE1800000$fN0840000$gS0700000,180,84,-180,-70,example
456,1#$aa$b80000$dW0825500$eW0822000$fN0273000$gN0265000,-82.3333333333333,27.5,-82.91666666666667,26.833333333333332,another example

Finally there is a tool that launches a simple web application where MARC 034 data can be entered and its bounding box will be displayed on a map. It looks like this:

There's nothing special about the web application. It follows the same pattern that I've been pursuing with similarly-sized map-based applications at SFO Museum. Writing about a bespoke web application built around the Placeholder geocoder I said:

It's a relatively simple web application that depends on three separate JavaScript frameworks, two different services and a whole other cartography component. They are:

  • Placeholder – The coarse geocoder we've been discussing so far.
  • Bootstrap – A framework for developing responsive and mobile websites.
  • Leaflet – A framework for web-based mapping applications.
  • Tangram – A framework for styling and rendering data in web-based mapping applications.
  • Nextzen – A free and open provider of geographic data for web-based mapping applications.
  • Tangram map styles – The cartography and styles used by Tangram to render data provided by Nextzen. These are also referred to as "scene files".

For most of the last decade the conventional wisdom has been to load each of these pieces remotely. Usually these pieces are hosted by some sort of third-party content distribution network (or CDN) to help speed things up and the guts of an application are often little more than the glue that binds all these remote bits of functionality together. It's a fine way of doing things but it's not the way that I wanted to set up a Placeholder service for the museum.

Instead I wanted:

  1. To run a local copy of Placeholder.
  2. For all of the Javascript dependencies (Bootstrap, Leaflet, Tangram) to be bundled with the application and served locally.
  3. To be able to read and write cached tiles from Nextzen.
  4. To be able to do all of this both locally, on a remote server and in the cloud.

The bounding box converter application does not need to run a local copy of Placeholder but otherwise all the concerns the same. With the bounding box application I've added the ability to use locally bundled copies of the Nextzen geographic data (vector tiles) so that the application can be entirely self-contained and run without any dependencies on third-party services.

These locally bundled data are called tilepackswhich are just MBTiles databases containing vector tile data compiled using the tilezen/go-tilepacks package. To use a local tilepack with the bounding box application pass the path to your database to the -nextzen-tilepack-database flag. For example:

$> ./bin/marc-034d -nextzen-tilepack-database tiles/nextzen-world-2019-1-10.db 
2021/10/23 14:27:33 listening on http://localhost:8080

The application won't perform any differently but if you look "under the hood" you'll see that the vector tile data is being served from the application itself.

Locally bundled tilepacks aren't necessarily the solution for every map-based web application. A tilepack with global coverage spanning zoom levels 1 through 10 is a 1.8GB file. Every subsequent zoom level roughly doubles in size. A tilepack with global coverage for zoom level 16, alone, will be over 100GB. Even accounting for optimizations like excluding tiles that are just water the tilepack will be so large that it may start to present a number of operational challenges.

For applications that don't need to account for global coverage, though, or that operate without the need for street-level tiles, this opens us more avenues for simple-yet-sophisticated applications (again, remember the layer-cake of functionality described above) that can be deployed with the guarantee that they'll just keep running independent of third-party services. At SFO Museum I've been working on a similar bundling of map tile data using the Protomaps toolchain, as well other spatial data, to create a self-contained geotagging application.

All of these applications have been written in Go which, I will be the first to admit, is not always the most pleasant language for developing web applications. There is much about the language that makes writing complex web applications difficult and time consuming. But there is a whole other class of simpler applications, of small focused tools, where the ability of Go to embed its static assets (its templates, web pages and other resources like CSS or JavaScript files) and in to a single binary application that runs its own web server and can be pre-compiled, inclusive of all its dependencies, for multiple operating systems is where it shines.

It shines because it makes a meaningful effort at removing so much of the operational hassle of other approaches whether it's a steep dependency chain, a steeper learning curve to master the tools to handle that dependency chain, the fickleness or short-term approaches adopted by one or more parts of a dependency chain or all of the above. It's not that other approaches don't have their merits so much as their requirements can be prohibitive.

This is especially true in an environment like the cultural heritage sector which is lacking the practice of developing a toolkit of small, simple-to-use and simpler-still-to-install-and-maintain tools. The sector is littered with battleship-sized projects that in their attempt to do everything only succeed in accomplishing nothing.

These tools for parsing MARC 034 records do very little but my hope is that, precisely because they only try to do one thing well, they will allow people to get other things done faster.

These tools also don't solve the actual problem of finding intersecting Who's On First records for bounding boxes. If you are familiar and comfortable with the dependency chain of spatial databases this becomes a trivial problem once your MARC 034 records look and feel like decimal bounding boxes. If not it remains tomorrow's problem. Importantly it remains a separate problem.

https://github.com/aaronland/go-marc/