Subject: Glossaries - XPath, SAX and benchmarks
Date: Sun, 8 Sep 2002 15:35:43 -0400 (EDT)
From: Aaron Straup Cope
To: Karl Dubost
Cc: Steph
Subject: Glossaries: XPath, SAX and benchmarks
So, I sat down and did some tests this morning per our conversation
about glossaries and XBEL and XPath.
It's a bit depressing given the nature of the XPath query you need to pull
stuff out of an XBEL document :
"/xbel//bookmark[title=\"$keyword\"]/\@href"
Since the <bookmark> element can be either next to the root <xbel> element
or contained in an arbitrary number of nested <folder> elements, there
isn't much too do except sniff around every node until you find what
you're looking for.
Which takes a long time. Longer than you'd normally want anyway...
On the other hand, if you just use a plain old SAX widget to find the
keyword, it takes roughly 1/4 to 1/5 of the time to do a lookup.
Below are benchmarks for 100 iterations of a subroutine that does 5
keyword lookups against an XBEL file.
Note that the XPath query doesn't even instantiate a new object; the same
object is shared across all 500 calls to 'find'. The SAX query on the
other hand, instantiates a new filter and a new parser for each lookup.
Obviously, some clever caching of lookups would speed things up as well.
****
101 ->./debug.xbel
Benchmark: timing 100 iterations of xpathquery...
bquery: 765 wallclock secs (645.73 usr + 13.66 sys = 659.38 CPU) @
0.15/s (n=100)
101 ->./debug.xbel
Benchmark: timing 100 iterations of saxquery_pureperl...
saxquery_pureperl: 171 wallclock secs (148.23 usr + 0.62 sys = 148.86
CPU) @ 0.67/s (n=100)
102 ->./debug.xbel
Benchmark: timing 100 iterations of saxquery_expat...
saxquery_expat: 171 wallclock secs (148.17 usr + 0.20 sys = 148.38 CPU) @
0.67/s (n=100)
****
package Foo;
use base qw (XML::SAX::Base);
sub keyword {
my $self = shift;
$self->{'__keyword'} = $_[0];
}
sub link {
my $self = shift;
return $self->{'__link'};
}
sub start_element {
my $self = shift;
my $data = shift;
return if ($self->{'__match'});
if ((! $self->{'__bookmark'}) && ($data->{Name} eq "bookmark")) {
$self->{'__bookmark'} = 1;
}
return if (! $self->{'__bookmark'});
if ($data->{Name} eq "bookmark") {
$self->{'__link'} = $data->{Attributes}->{'{}href'}->{Value};
}
$self->{'__title'} = 1 if ($data->{Name} eq "title");
}
sub end_element {
my $self = shift;
my $data = shift;
return if ($self->{'__match'});
if ($data->{Name} eq "title") {
$self->{'__title'} = 0;
}
if ($data->{Name} eq "bookmark") {
$self->{'__bookmark'} = 0;
}
}
sub characters {
my $self = shift;
my $data = shift;
return if ($self->{'__match'});
return if (! $self->{'__bookmark'});
return if (! $self->{'__title'});
if ($data->{Data} eq $self->{'__keyword'}) {
$self->{'__match'} = 1;
}
}
package main;
my $file = "/usr/home/asc/aaronland.net/asc/webdev.xbel";
use XML::SAX::ParserFactory;
$XML::SAX::ParserPackage = "XML::SAX::Expat";
use Benchmark;
my $count = 100;
my @keywords = (
'FilterProxy Home Page',
"REX XML Shallow Parsing with Regular Expressions",
"aaronland",
"Schematron - XML Validation Language",
">RE ActivePerl mod_perl ppd available",
);
timethese($count, {
saxquery_expat => sub {
foreach my $kw (@keywords) {
my $filter = Foo->new();
$filter->keyword($kw);
my $parser = XML::SAX::ParserFactory->parser(Handler=>$filter);
$parser->parse_uri($file);
}
},
});
****
use XML::XPath;
use Benchmark;
my $file = "/usr/home/asc/aaronland.net/asc/webdev.xbel";
my $count = 100;
my $xbel = XML::XPath->new(filename=>$file);
my @keywords = (
'FilterProxy Home Page',
"REX XML Shallow Parsing with Regular Expressions",
"aaronland",
"Schematron - XML Validation Language",
">RE ActivePerl mod_perl ppd available",
);
timethese($count, {
xpathquery => sub {
foreach my $title (@keywords) {
my $query = "/xbel//bookmark[title=\"$title\"]/\@href";
my $r = $xbel->find($query);
}
},
});