In a spinny bar
And then I said...
See also: the short version.
#interpretationpy-wsclustr.php
Where place matters more and space matters less...
I will be at Museums and the Web, this week, to talk about the work we've been doing at Flickr around geotagging photos, reverse-geocoding and shapefiles and more broadly notions of bias in and the interpretation of place. Plus, I get to speak alongside the Philly History crew which is extra-exciting!
I'm other thing that I'm excited about is being to talk about how Clustr, the open source tool we use to generate shapefiles, is now bundled as part of the Maps From Scratch Amazon EC2 AMI. There's a long and detailed blog post about all that on the code.flickr blog but the short version is:
We expressly chose to make Clustr an open-source project to share some of the tools we’ve developed with the community but it has also always had a relatively high barrier to entry. Building and configuring a Unix machine is often more that most people are interested in, let alone compiling big and complicated maths libraries from scratch. Clustr on EC2 is not a magic pony factory but hopefully it will make the application a little friendlier.
In that post I talked about wanting to be able to use
Clustr by calling a simple web service so eventually I wrote the
quickest and dirtiest implementation I could think of: a PHP
script that simply shells out to the Clustr application and
then returns the output (compressed). I encourage anyone who
wants to get hung up on the lack of elegance
in that
approach to port CGAL to PHP. Your efforts
will be amply rewarded, I'm sure, but in the meantime this already works:
$> curl -H 'x-clustr-alpha:0.00001' -v --data-binary '@/path/to/points.txt' \ http://ec2-xxxxxxxx.compute-1.amazonaws.com/ws-clustr/ > ~/path/to/shapefile.tar.gz
ws-clustr.php is available for anyone to download on GitHub, along with a handy README file for getting it to work with the Maps From Scratch AMI. Which is all good but you still need something to make shapes of. How about all the geotagged photos uploaded to Flickr on March 24, 2009:
$> python flickr-tools/geotagged.for_day.py -c /path/to/flickr.cfg -d '2009-03-24' --clustr
That yields a file with 54, 673 points that I can ask
ws-clustr to plot. By passing those points to
ws-clustr with a variety of alpha sizes (11 times
to be exact) I was able to generate the following image in QGIS:
The geotagged.for_day.py script is one of several Flickr related helper tools available for
download on Github as part of the flickr-tools package.
So now what? Or rather: What if my mapfromscratch/ws-clustr AMI isn't already up and running and I want to generate hawt shapefile action? EC2 servers are great for doing short-fast tasks but if left running for days or weeks on end starts to incur noticeable fees. Fortunately, starting and stopping EC2 can be done programatically so I wrote a client-side interface, in Python, to (ws) Clustr that starts a new EC2 instance, exchanges a points file for a (compressed) shapefile and then shuts the server down again. The code also checks to see if there is already a running instance of the AMI you want to use and simply uses that one if available.
Like this:
from wsclustr import wsclustr
wsc = wsclustr('amz_access_key', 'amz_secret_key')
wsc.startup('ami-xxxxx')
while not wsc.ready() :
time.sleep(5)
shpfile = wsc.clustr('2009-03-24-geotagged.txt')
wsc.shutdown()
Which was great, except for the part where I sent the
same 1.3MB file across the wire 11 times in order
to create all the shapefiles for the image above. EC2 is
pretty cheap as far as these things go but sooner or later
all that data and traffic is going to add up and Amazon
won't hesitate to send you a bill for it. So, now
both ws-clustr and py-wsclustr
support an equally bare-bones caching layer for the data the
client sends to the server. As far as the Python
side of things go, it looks and acts like this:
shpfile1 = wsc.clustr('2009-03-24-geotagged.txt', alpha=0.001, try_cache=1)
shpfile2 = wsc.clustr('2009-03-24-geotagged.txt', alpha=0.01, try_cache=1)
shpfile3 = wsc.clustr('2009-03-24-geotagged.txt', alpha=0.1, try_cache=1)
If the cached version exists on the server then the
shapefile will be generated using that without the client
having to send all that data again. If the cached version
does not exist then the server will return an HTTP 404 error
and the client will re-try the request with the data. Caches
are stored and referenced with identifiers generated from
the contents of the data file. Specifically: clustr-
+ the
value of md5sum(2009-03-24-geotagged.txt). If you
look behind the curtain, what's actually being sent to the
server is something like this:
$> curl -H 'x-clustr-alpha:0.01' -H 'x-clustr-cache: clustr-c77cae39a4f7e506a9cc8205176f1239' \ http://ec2-xxxxxxxx.compute-1.amazonaws.com/ws-clustr/ > ~/path/to/shapefile.tar.gz
The Housekeeping Department would like me to remind you that
it is left as an exercise to people running their own
ws-clustr servers to take care of cleaning up their
system's temporary directories, where the cache files are
stored. ws-clustr was built to run on an EC2
instance where it is expected that the server, along with all its
data, will be torn down long before disk space becomes an issue
but since it's just a PHP script there's nothing to prevent it
from being used outside of Amazon's cloud castle. Just something
to keep in mind.
Likewise with caching the output, or supporting
something like If-Modified tags, which currently isn't done yet
for two reasons. The first is that Clustr is just Really Fast so
I'd rather spend my time solving other problems than caching for
caching's sake. The second is that there's no (automatic)
expectation that the EC2 server running ws-clustr
will ever be running long enough to warrant caching shapefiles
by their alpha number and the contents of their data. Again, if
people start to use the server outside of EC2 then it might be
warranted but until then there are problems better solved sooner.
Now that you've sucked down shapefiles in Python it would be
useful to do something with them. I like using Zachary
Forest Johnson's shpUtils.py library to do the actual
parsing (though the ESRI shapefile spec is
actually pretty simple if you need to write a specialized one-off). Here is some sample
code to parse a shapefile returned by ws-clustr and
munge it in to list of Shapely
polygon objects. Shapely is useful for doing all sorts of hairy
geometry and head-scratchy math but the shorter way to think about
it is that it's basically Just Awesome.
The complete code listing is included in the examples directory of the
py-wsclustr project on GitHub.
t = tarfile.open(shpfile)
t.extractall()
# Because the tarfile.getnames method always seems
# return the list of files in random order...
shp = shpfile.replace(".tar.gz", "")
shp = "%s/%s.shp" % (shp, shp)
import shpUtils
from shapely.geometry import Polygon
polys = []
for record in shpUtils.loadShapefile(shp) :
for part in record['shp_data']['parts'] :
poly = []
for pt in part['points'] :
if pt.has_key('x') and pt.has_key('y') :
poly.append((pt['x'], pt['y']))
poly = tuple(poly)
p = Polygon(poly)
polys.append(p)
Or, if you're like me you'll want to display all those
shapes using ModestMaps. Here is
the code used to generate the image below, modulo the part
where the modestMMarkers package is not public
yet. This is code still under active development to display
the turkishMMap (remember that?) cluster-y bits but that's
not really the point. The point is that there are now a few
more nubby bits
in the toolbox with which to build
things. I happen to have a bit of a map fetish.
alphas = (100, 25, 10, 5, 1, .1, .01, .05, .001, .0005)
swlat = None
swlon = None
nelat = None
nelon = None
shapes = []
for a in alphas :
shpfile = clustr.clustr('2009-03-24-geotagged.txt', alpha=a, try_cache=True)
t = tarfile.open(shpfile)
t.extractall()
shp = shpfile.replace(".tar.gz", "")
shp = "%s/%s.shp" % (shp, shp)
records = shpUtils.loadShapefile(shp)
polys = []
for record in records :
# this is a bit redundant since it only
# needs to be calculated once but you get
# the idea...
data = record['shp_data']
if not swlat :
swlat = data['ymin']
else :
swlat = min(swlat, data['ymin'])
if not swlon :
swlon = data['xmin']
else :
swlon = min(swlon, data['xmin'])
if not nelat :
nelat = data['ymax']
else :
nelat = max(nelat, data['ymax'])
if not nelon :
nelon = data['xmax']
else :
nelon = max(nelon, data['xmax'])
for part in record['shp_data']['parts'] :
poly = []
for pt in part['points'] :
if pt.has_key('x') and pt.has_key('y') :
poly.append({'longitude':pt['x'], 'latitude':pt['y']})
polys.append(poly)
shapes.append(polys)
w = 6000
h = 4000
pr = ModestMaps.builtinProviders['BLUE_MARBLE']()
sw = ModestMaps.Geo.Location(swlat, swlon)
ne = ModestMaps.Geo.Location(nelat, nelon)
dims = ModestMaps.Core.Point(w, h)
mm_obj = ModestMaps.mapByExtent(pr, sw, ne, dims)
map_img = mm_obj.draw()
shp_img = PIL.Image.new('RGBA', (w, h), 'white')
# Hey look! This is modestMMarkers.py; it has not been released yet!!
poly = modestMMarkers.polylines.polyline(mm_obj)
for polys in shapes :
shp_img = poly.draw_polylines(shp_img, polys, color=(0,0,0))
mask = shp_img.convert('L')
enh = PIL.ImageEnhance.Contrast(mask)
mask = enh.enhance(2.5)
mask = PIL.ImageChops.invert(mask)
cnv = PIL.Image.new('RGBA', (w, h), 'white')
cnv.paste(map_img, (0, 0), mask)
No, really.
Like everything else, py-wsclustr is available for anyone to play with on the GitHub. At some point in the near future I will make sure that all these packages are also given a home on aaronland.info, filed under Just In Case.
As an aside, I finally made my peace with EC2 and Amazon on the
grounds that, at the end of the day, it's just a plain old
Unix box with tailored build instructions that can be backed
up and re-created like any other server and if you're not
already backing up your machines then you've got bigger
problems than whether or not Jeff Bezos wants all your
base. Compare this to Google's AppEngine which looks really
interesting but for some
reason requires that you
give them your fucking phone number to sign up for a
developer's account. It's like a whole new and perverted
twist on the honeypot some days...
Meanwhile, come May I will be speaking about Clustr and
shapefiles and communities of authority
at Where 2.0, in San
Jose. In the talk-is-cheap-always-try-to-have-working-code
department I had sort of imagined not being able to get to
the HTTP client libraries for Clustr working so soon; now I'll
just have to dream up something new to share with people!
If you've been thinking about attending but needed a little
more coaxing the nice folks at O'Reilly have given me a
25% discount code (for the registration fee) to pass along:
WHR09FSP.
In July, I am looking forward to returning to Vancouver and speaking at GeoWeb 2009 about the idea of nearby, and history boxes and trying to encourage a more nuanced understanding of place that can be read and traveled like a contour map of meaning. Or something like that. There's a lot of twisty in that one so I am pleased to have the chance to try and give a little more form to the idea. Indeed, there are still long and twisty blog posts about nearby and history boxes and the importance of artifacts and the Papernet to be written, each of which will surely feed the talk.
But not tonight.
#py-wsclustr-php



