this is aaronland

peanuts in red sauce

webster-cli

For a long time I have wanted a command line tool for rendering PDF files from URLs complete with support for print CSS. Something like Paul Hammond's webkit2png but for PDF files. Apparently this is about to become very easy in iOS 14 and MacOS 11 but until now it’s been a nuisance. Over the weekend I managed to get the starter code for a MacOS command-line tool to generate PDF files from websites, written by Matt Smollinger a former colleague at Mapzen, working again.

It works like this:

$> webster-cli https://aaronland.info/weblog/2020/08/24/peanuts --height 9 --width 6

And it produces a file that looks like this:

You can see the actual PDF file here.

That's all it does. I was asked by a friend why this tool was necessary and this is what I said:

Mostly for making print-on-demand-ready editions of personal websites so that there is a copy of that work that doesn’t require electricity to work. I’ve done the same for work stuff like baking a PDF copy of the Mapzen weblog (and uploading it to the Internet Archive) when it wasn’t clear what would survive the shutdown of the company.

The goal is never to supplant websites or plain vanilla HTML but to use them as the seed for the PDF file, hence the use of WebKit which has proper support for print CSS.

In a museum context it becomes an interesting way to think about generating custom on-demand “catalogs” for people. This is especially interesting for a place like SFO Museum since people will be getting on planes with limited or garbage internet access.

Examples of objects from the Cooper Hewitt collection rendered using print stylesheets.

We did some work around print stylesheets for museum objects at the Cooper Hewitt. In a 2013 blog post titled cmd-P Katie Shelly wrote:

What we realized in looking at all the printouts, though, is that the simplified view of a collection record resembles a gallery wall label. And we’re currently knee-deep in the wall label discussion here at the Museum as we re-design the galleries (what does it need? what doesn’t it need? what can it do? how can it delight? how can it inform?).

webster-cli is not the first tool to generate PDF files from webpages. I used to use a tool called wkpdf until it stopped being maintained. There is also the wkhtmltopdf tool which has been around for ages. I haven't tried using it recently but in the past I've found the support for print stylesheets to be lacking and it has a long list of dependencies that make it difficult to install. Unlike these tools webster-cli is a native MacOS application. My hope is that this will give it a little more stability over the long run while also ensuring access to all the latest features when it comes to rendering web pages.

I've taken Matt's original tool, called webster, and broken it in to two pieces:

One of the reasons for breaking things up this way is that, while thinking about how SFO Museum might use these tools to generate a catalog of web pages on demand I realized that this is code that will only work on a MacOS or iOS device. Even though Swift, the programming language used to write the tool, has been ported to Linux, and can even be used to write Lambda functions in AWS, the WebKit framework which handles all the work rendering web pages is only available on operating systems produced by Apple.

In order to automate the production of PDF files it made sense to separate the code that does all the work from the code that triggers that work. That trigger might be a command line application running on my laptop or, eventually, a web service so that a process running on another computer can trigger the creation of a PDF file.

I mention this because one of the arguments I have been trying to advance recently is that the cultural heritage sector needs to revisit how it produces software, and other technologies, used by and for the community. An abbreviated version of that argument is:

In as much as there are organizations with small, or limited, technical staff I think that there needs to be a shift in the way we produce the software and tools used to service our respective digital initiatives. Although there is often a commonality of needs between two or more cultural heritage institutions there is rarely any overlap in how those needs are delivered or implemented.

In fairness the library and archives sector has had a better success, building and delivering shared tools, but the same can not be said of museums. Common tooling that aims to address the specific quirks of anything but a single museum have, historically, become an unweildly burden that no one wants to shoulder.

Instead I believe that we should undertake the effort to break apart the tools we do build in to layers of generic functionality that aren't specific a given organization or project. These small, focused tools are the things that we should be sharing rather than finished projects. These should be the common layer of plumbing and a kit of parts that we can all reach to, that underpin the different and varied initiatives we undertake.

Implicit in this argument is the idea that there is someone in an organization to reassemble this kit of parts in concert with the specific needs and technologies bespoke to that organization. You have to believe that it's possible to have an in-house technical staff. Conversely, that in-house technical staff needs to ensure that their work is economically viable and one way to do that is by sharing tools that can be used, and improved upon, across the cultural heritage sector.

webster-cli, for example, doesn't address what your print stylesheet looks like, nor does it say anything about which web pages it renders. It doesn't even handle combining multiple PDF files in to a single catalog. It does exactly one thing: It automates the production of a PDF file from a URL. The many different uses for that PDF file are left up to individuals using webster-cli to decide and to see through to completion.

The cultural heritage sector has never been able to afford the 400-person software engineering staffs that are the norm in the private sector. Nor can it continue to afford the cost of third-party client-services technology providers. I believe it can, should and needs to afford something at the intersection of those two extremes and nothing.

Fostering a practice of small, focused tools is one part of how we can make that possible.