this is aaronland

make things out of pixels

papernet.css

I am one of the six or seven people left in the world who write and maintain their own blogging software. Way back in 2005 I rewrote the toolchain that produces this weblog. It was not the first time. I did this for a few reasons. By that time I had tried almost every hair-brained scheme under the sun for generating a dynamic site and watched as they all failed and ate up hours and hours of fiddling and debugging time. For about six months I was using some insane mod_perl setup that failed so often I ended up writing blog posts in the error template that the server would send when something went wrong. So my first interest in yet-another publishing system was: Nothing more complicated than flat files.

Second in 2005 it seemed like, just for a moment, that native XSLT processing was going to be a thing worth considering in web browsers. I might have been drunk when I thought this. Anyway it was just one symptom of that same fever-dream we've always been chasing: That there's some magic way of writing things down which is simultaneously easy to produce and to read with the naked eye but still sufficiently structured to allow for reuse and complex conditional display. Pure journey-not-the-destination stuff so there's really no point in arguing about the details. Eight years ago I chose XML and XSLT and next year you will choose something that will no better stand the test of time.

Specifically I chose XHTML that stricter than strict reformulation of HTML that was going to save us all until a lot of people, quite rightly, pointed out that web browsers had long ago learned to deal with shoddy markup and less than perfect documents and maybe we all had better things to do than worrying about that stuff. But I went down that rabbit hole and came out with a way to publish blog posts that looks something like this:

$> xsltproc --novalid \
    -o ${LOCAL_CONTENT}/${1}/index.html \
    ${LOCAL_WEBLOG}/lib/xsl/bland/bland-noprefixes.xsl \
    ${LOCAL_CONTENT}/${1}/index.xml

The part where I am both invoking the --novalid flag and using an explicit no (XML namespaces) prefixes stylesheet should give you some indication of why trying to use the standard XML toolchain to convert HTML in to, well, HTML is generally a waste of time. I got it to work for myself but I wouldn't recommend the exercise to anyone else. It's not XML per se nor is it HTML. It's that the two ways of doing things came in to the world at roughly the same time and have a fundamentally split-brain relationship with one another like some awful sibling rivalry that can never be resolved.

The stylesheet itself doesn't really do much other than wrap the contents in index.xml (the words you're reading now) in a fancy header and ter and include links to neighbouring blog posts and things like CSS stylesheets. I haven't bothered changing much about the design and stucture of this weblog in the last eight years but I have in the past and I might in the future. I like the idea of preserving past designs but more than that I like the idea that its something enforced by not re-rendering the source files (the words you're reading now) rather than tatoo-ing the present on the future.

For example, this is index.xml and this is index.html after it's been massaged by the stylesheet. So that's the setup. Remember that part of the fever-dream where there's a magic rendering fairy that will turn your eggs in to chickens and vice versa? Don't worry. This is actually the good part of the story. This is the part where things feel like they're getting a little bit better.

I recently added the following CSS instruction to the (XSLT) stylesheet that renders blog posts:

<link rel="stylesheet" href="http://www.aaronland.info/css/weblog/print.css" type="text/css" media="print"/>

Since then I've slowly been working my backwards through the archives re-rendering things and adjusting the markup where necessary. It would be nice to tell you that I didn't need to tweak things but among this weblog's many sins was the decision to wrap individual blog posts in <ins> tags because ... stupid? Also, all those presentation slides that need rescuing from Slideshare. Live and learn and somewhere during all of that the lovely Christian Plessl wrote a command-line tool called wkpdf that uses the WebKit rendering engine to convert HTML documents in to PDF documents.

Although there are plenty of browsers available for Mac OS X, I could not find a command-line tool that allows for downloading a website and storing the rendered website as PDF. This was my motivation for creating wkpdf. The application uses Apple WebKit for rendering the HTML pages, thus the result should look similar to what you get when printing the webpage with Safari.

Since wkpdf is based on the state-of-the-art WebKit HTML rendering framework, it provides high-quality web standard compliant HTML rendering with support for advanced CSS2/CSS3 styling.

Because I've added a custom print.css stylesheet to all the rendered blog posts I can generate a handy PDF version simply by typing this on the command-line:

$> wkpdf --no-caching -n --stylesheet-media print -p custom:432x648 -m 48 54 72 48 --output index.pdf --source index.html

I am not creating PDF version of every blog post which really means sets of blog posts since there are often multiple posts that live under a single URL. Instead I've been bundling up all the blog posts for a single year and generating a single big honking PDF file. 2012 is 393 pages long; 2011 has 379 pages; 2010 has 188 pages; 2009 has 466 pages and so on. The page length is mostly to account for talks and presentations — remember Operation Escape from Slideshare? — which increasingly are delivered with novel-length notes and are formatted a bit differently in print than they are online.

In talks of late I've been quick to mention that most of the Internet-based awesome we've all come to depend on is predicated on the power grid. I don't say these sorts of things to spook people but because I grew up in a place that is brutally-like-you-die cold most of the time and so you become keenly aware of the infrastucture that keeps you warm. Elecricity has become like oxygen, for most people, but it is not so its worth contemplating its absence and having some vague conceptual framework besides panic for dealing with its absence.

If the power went out or the company that hosts this weblog went broke or every disk that keeps a copy of the source files suffered a catastrophic failure and the last eight years (fourteen really) went *poof* I would be sad. There might be more pressing matters to attend to if that happened but, you know, in principle and all that.

So this — the ability to take a weblog (a creature born-digital if there ever was one) and dress it up as a book — feels like a way to think about those larger problems in small and tangible steps. If the End Days befall us then I don't know that I will choose to take these stacks of paper that outline the last couple decades but at least it will be a choice. The Papernet all the way down, right?

In the meantime it also means that I have something (a PDF file per year of blog posts) more useful than a bag of raw HTML files to share with the Internet Archive. I still need to finish the years 2006 through 2008 but I am hoping that with everything I've learned doing 2009-2013 it won't take too much longer. Like the physical books it feels like a more concrete way to think about the problem of digital preservation sheltered, even just a little bit, from the actual enormity of the problem.

I like that.

There's a link above to the Internet Archive search results page for ?query=thisisaaronland but that endpoint seems to be a little flakey these days (as in: it returns no results) so I've also included links to the individual PDF files here. I'll add the rest as they get done.

I've also included the script I use to generate the annuals, below, as a reference. It uses wkpdf (mentioned above) to render individual HTML files and PDFbox to stitch them all together. You should treat this code as nothing more than pseudo-code. It works but is very much specific to the corner(s) I've painted myself in to however the basic principles should apply regardless.

#!/bin/sh

# http://plessl.github.io/wkpdf/
# https://pdfbox.apache.org/

WKPDF=`which wkpdf`
JAVA=`which java`

PDFBOX="${JAVA} -jar /usr/local/bin/pdfbox.jar"

ROOT=$1
OUT=$2

PARENT=`basename $ROOT`

if [ -z $OUT ]
then
    OUT=${ROOT}'.pdf'
fi

i=0

for f in `find $ROOT -name 'index.xml' -print`
do
    i=$(($i+1))

    num=`printf "%02d" $i`

    # see what we're doing here...
    # it's kind of stupid...
    # oh well...

    f=`dirname $f`'/index.html'
    f=`python -c 'import os, sys; print os.path.realpath(sys.argv[1])' $f`

    r=`dirname $f`
    r=`basename $r`

    TMP='/tmp/pdf-book-'$PARENT'-'$num'-'$r'.pdf'
    echo "make ${TMP} from ${f}"

    ${WKPDF} --no-caching -y print -p custom:432x648 -m 48 54 72 48 -o $TMP -n -s $f

done

# Because I got bored of trying to figure out arrays in (ba)sh
# Patches are definitely welcome...

PDFDOCS=''
COUNT_DOCS=0

for f in `ls -a /tmp/pdf-book-${PARENT}-*.pdf`
do
    COUNT_DOCS=$(($COUNT_DOCS+1))

    if [[ -z $PDFDOCS ]]
    then
	PDFDOCS=$f
    else
	PDFDOCS=$PDFDOCS' '$f
    fi
done

if [ $COUNT_DOCS -gt 1 ]
then
    echo "make ${OUT}"
    echo "${PDFBOX} PDFMerger $PDFDOCS ${OUT}"

    ${PDFBOX} PDFMerger $PDFDOCS ${OUT}
else
    echo "cp ${PDFDOCS} to ${OUT}"
    cp ${PDFDOCS} ${OUT}
fi

# Apparently passing more than 6 documents to pdfbox makes it sad...

if [ -f $OUT ]
then
    for f in `ls -a /tmp/pdf-book-${PARENT}-*.pdf`
    do
	echo "remove ${f}"
	rm $f
    done
fi

# Also, this works but sucks:
# gm convert -density 150 -scale 864x1296 pdf-book-2012-*.pdf wtf.pdf

echo "done"
exit