this is aaronland

Mining for Pynchonite

Writing filters for Namazu, in a nutshell

Some time last year I gave up trying to use Movable Type to archive my email. Instead, I've been using MHonArc and Namazu, a full-text indexer with both command-line and web interfaces, plus some fancy-pants templates and bookmarklets to make searching easier and prettier.

As usual, at some point I will get around to making all of that stuff public. But not today. If for no other reason than that it is written to expect that you store your email in nested (YYYY/MM/DD) directories which apparently the rest of the world thinks is weird.

Which is context for saying that you can use Namazu to index just about anything, especially if you can write a custom filter to extract, infer or generate data from an arbitrary file. Somewhere on my TODO list is a proper search interface for this weblog.

Except for the part where what little documentation there is for writing filters generally leaves you asking more questions that when you started.

So, in an effort to give something back to the Interweb here is the quick and dirty guide to writing Namazu filters. This is by no means a complete reference and it may not contain all the information you need but it does contain everything I needed this morning (and am likely to forget by the end of the week).

  1. Filters are written in Perl. Deal with it, or at least learn enough to use the open function.
  2. The stuff that finally gets indexed by Namazu is stored in the $cont_ref variable which is, well, a scalar reference.
  3. You can override the list of default fields that a user can query on by (re) setting $conf::META_TAGS (and using the -M flag when building your index) and storing their values in $fields which is a hash reference.
  4. Do yourself a favour and define an x-type for whatever you're trying to index and force it, using the -t option, when you are building your index. This is especially useful when you are crunching XML files since they seem to be handled by the HTML widget by default.
  5. If you return anything other than undef from the filter function, Namazu will assume it is an error string, spew it to STDERR and not index the file.

That's pretty much it. An actual filter package looks something like this :

package bucketz;
require 'util.pl';

# A bunch of hooks that at
# least need to be present...

sub status { "yes"; }
sub recursive { 0; }
sub pre_codeconv { 0; }
sub post_codeconv { 0; }
sub add_magic { undef; }

sub mediatype {
        return ("application/xml; x-type=BUCKET");
}

sub filter {
        my $orig_cfile = shift;
        my $cont_ref = shift;
        my $weight_str = shift;
        my $headings = shift;       
        my $fields = shift;

        my $cfile = defined $orig_cfile ? $$orig_cfile : '';

        if (! &icanhas($cfile)){
        	return "FAIL!!!!!!";
        }

        my $extra = join("|", ("bucket", "cheezburger", "meme"));

        if ($conf::META_TAGS !~ /$extra/){     
        	$conf::META_TAGS .= $extra;
        }

        # DO STUFF WITH $cfile, $cont_ref and $fields here...
        # DO STUFF WITH the other variables; that's your business

        $$cont_ref = "OH HAI";
}

return 1;
                    

Once you've installed the filter in a place where Namazu can find it you need to rebuild your index like so :

$> mknmz -M -a -t 'application/xml; x-type=BUCKET' -O /yer/index /yer/filez

By default, Namazu is set up to index plain text and only allows you to define email-style fields; From, Subject, etc. The plain text part isn't going to change but the ability to write your own converter and to also add custom limiting agents (fields) allows you to [insert obligatory Flickr tags hack here].

Or :

$>namazu '+place:russia'   ~/news/nytimes/.namazu/
Results:

References:  [ +place:russia: 2 ] 

 Total 2 documents matching your query.

1. The Kremlin Flexes, and a Tycoon Reels (score: 1)
Author: Andrew Kramer
Date: Sun, 08 Jul 2007 10:44:11 -0800
The end of a partnership in the world's largest nickel producer illuminates how
the Kremlin and ambitious Russian businessmen do business together.
http://www.nytimes.com/2007/07/08/business/yourmoney/08nickel.html (21,957 bytes)

2. Youth Groups Created by Kremlin Serve Putin's Cause (score: 1)
Author: Steven Lee myers
Date: Sun, 08 Jul 2007 10:44:10 -0800
A youth movement seeks the ideological cultivation, some say indoctrination, of
the first generation to come of age in post-Soviet Russia.
http://www.nytimes.com/2007/07/08/world/europe/08moscow.html (12,788 bytes)
                    

Now you know.

Net::Flickr::Geo->tease()

You know the thing with the maps and the tiles? And the other thing with the math? And then the part where you have to keep track of who's on first because no one does it the same way and every provider does one map version better than the others?

Well, Mike Migurski loves you :

   437 |sub fetch_modest_map_image {
   438 |        my $self = shift;
   439 |        my $lat  = shift;
   440 |        my $lon  = shift;
   441 |        my $acc  = shift;
   442 |
   443 |        my $path_composer = $self->{'cfg'}->param("modestmap.composer");
   444 |        my $path_python   = $self->{'cfg'}->param("modestmap.python");
   445 |
   446 |        my $provider = $self->{'cfg'}->param("modestmap.provider");
   447 |
   448 |        $provider =~ /^([^_]+)_/;
   449 |        my $short = lc($1);
   450 |
   451 |        my $acc = $self->mk_flickr_accuracy($short, $acc);
   452 |        my $out = $self->mk_tempfile(".png");
   453 |
   454 |        my $h = $self->pinwin_map_dimensions("height");
   455 |        my $w = $self->pinwin_map_dimensions("width");
   456 |
   457 |        my $cmd = "$path_python $path_composer -d $h $w -p $provider -c $lat $lon $acc -o $out";
   458 |        system($cmd);
   459 |
   460 |        if (($!) || (! -f $out)){
   461 |                $self->log()->error("Failed to create modest map, $!");
   462 |                return undef;
   463 |        }
   464 |
   465 |        return $out;
   466 |}
                    

Ideally, I would like to write a ws-compose.py endpoint, to be run locally or on a remote server, so that users can simply point to something starting in http:// in their config file. There's also the part where I haven't actually tested placing thumbnails for any provider except Yahoo!

I am going to read the paper, now, but I will try to push something out later today...

Net::Flickr::Geo.pm 0.3

FIREBAGEL!!!!!!

There's lots to write about the trip to Europe but that will have to wait for a lazy morning, with more coffee. In the meantime, I've written Perl bindings for the FireEagle location service, recently announced at the HackDay UK event.

FireEagle is not a magic bullet but it's an especially good toe-hold in a world where any sort of automagic tracking of your location is either a pain in the ass or just plain creepy. Usually both. Speaking of which :

Does FireEagle keep track of my whereabouts?

Fire Eagle only remembers your latest location updates, and does not keep your location history. If you allow other applications to read your location, these application may build a history of your locations. We have no control over that - and recommend that you choose your trusted applications wisely. If you are unsure about applications you sign up for, you can always revoke their permission to access your account.

That's a bit of an artful dodge but it's also the only way to do it, in the short-term, given the quicksand surrounding privacy and location.

So, that's it. You tell FireEagle where you are, to whatever level of granularity you are comfortable with. And then you can ask FireEagle where you most recently were. And more importantly, third-party applications can ask where you were on your behalf at a more or less precise level of granularity.

Then, you know, you do stuff with that information. Here's some example code that uses my stored location to generate (machine) tags via the handy Geonames database :

Readonly::Scalar my $GEONAMES_API_SCHEME => "http";
Readonly::Scalar my $GEONAMES_API_HOST   => "ws.geonames.org";
Readonly::Scalar my $GEONAMES_API_SEARCH => "/search";

sub main {
        my %opts = ();
        getopts('c:', \%opts);
        
        my $fe  = Net::FireEagle->new($opts{'c'});
        $fe->update_location("San Francisco CA");
        my $res = $fe->query_location();
        
        my $locality = $res->findvalue("/ResultSet/Result/city");
        my $region   = $res->findvalue("/ResultSet/Result/state");
        my $co       = $res->findvalue("/ResultSet/Result/countrycode");

        my $uri = URI->new();
        $uri->scheme($GEONAMES_API_SCHEME);
        $uri->host($GEONAMES_API_HOST);
        $uri->path($GEONAMES_API_SEARCH);

        $uri->query_form(style => "full",
                         country => $co,
                         fcode => 'PPL',
                         maxRows => 1,
                         name => "$locality, $region");
        
        my $req = HTTP::Request->new(GET => $uri->as_string());
        my $res = $fe->request($req);

        my $xml = $fe->parse_response($res);
        my $geocode = $xml->findvalue("/geonames/geoname/geonameId");

        my @tags = ("geonames:locality=$geocode",
                    "geo:locality=\"$locality\"",
                    "geo:region=\"$region\"",
                    "geo:country=$co",
                   );

        return \@tags;
}
                    

Which returns :

$VAR1 = [
          'geonames:locality=5391959',
          'geo:locality="San Francisco"',
          'geo:region="California"',
          'geo:country=US'
        ];
                    

Ladies and gentlemen, Net::FireEagle.pm

Net::Flickr::Geo

For as long as we've had the ability to send (Flickr) photos to printing services I've wanted to make my own maps.

Which is tricky because there are almost no non-web interfaces to do the sort of API magic you can do with any of the big name (online) mapping services. Those that do exist all use the same tiles, namely low-resolution versions designed for the web. There's also the aggressively boring part (and despite any appearance to the contrary, I am pretty lazy) of having to keep up with whatever magic is necessary to stitch a bunch of tiles into a single background image.

Mike tells me that ModestMaps has just this sort of compositing magic built-in so I look forward to finding the time to be proven wrong. Beyond that, there is the Yahoo! Map Image API which is so close to being what I want (read: stupid-dumb easy) and yet so far away (read : no markers, maps that don't lend themselves to printing).

When we got back from Europe last month, I finally sat down to make something using the Yahoo! APIs even if it's the sort of thing that I look back and laugh at in a couple years.

This is what I got :

my %opts = ();
getopts('c:', \%opts);

my $cfg  = Config::Simple->new($opts{'c'});
my $fl   = Net::Flickr::Geo->new($cfg);

my @maps = $fl->mk_pinwin_maps_for_set('72157600321286227', 'upload'); 
                    

This is how it works :

  1. For every geotagged photo in a set :
  2. Fetch the thumbnail and place it on a blank pinwin.
  3. Fetch a map, using the Yahoo! Map Image API, corresponding to the photo's latitude and longitude and accuracy.
  4. Place the newly created pinwin over the default map marker.
  5. Upload the photo to Flickr, adding it to the same set as the original, updating the set's ordering such that the map always appears before the photo.

In the end, your set looks something like this :

Which isn't that exciting except that it is enough to send off to QOOP for printing in a book (modulo boring details like a blank first image to ensure that maps are always printed on even-numbered pages).

Which is kind of exciting because the maps help put the photo in context and make for better story-telling.

Barely related, at all, I've been reading Against the Day which has its ups and downs but also includes this lovely passage : ...pausing to gaze at ruined frescoes as if they were maps in which the parts worn away by time were the oceans...

I guess the next step is to add the ability to place arbitrary markers on a map image. Then you could also query Flickr for other — yours or your contact's or everyone's— photos taken within the bounding box of the map and some limited window of time and see from a distance what else was happenning when you took a photo.

Or begin to stitch together the map images to create maps of arbitrary size — or, more specifically, big enough to fit any collection of photos at a given zoom level — to have printed as posters. By which I mean : big sheets of paper which you can fold in to proper guides (even if the paper these things are printed still doesn't lend itself to that sort of thing...)

I'd also like to try all of this with other map providers, specifically the Open Street Maps stuff, where possible. At a time when people are complaining about the lack of detail on maps I find myself wanting simpler, and more stylized, renderings; something to act as background music, or a soundtrack, to a photo rather than smothering it with an encyclopedic knowledge of urban minutiae.

So, anyway, there you go : Net::Flickr::Geo.pm

If it's not already on the CPAN by the time you read this, it will be soon.

Meanwhile, back at the Ranch

The lovely Dan Catt and I will be in London this week, for Hack Day, to bring you the machine tag love. I will also be around for the London 24 Hours of Flickr event as well as those in Paris and Montréal. Come say hello!