set_cookies()
set_cookies_file($browser,$path_to_cookies)
set_log_dispatcher($obj)
set_callback($name,\&callback)
fetch($date)
fetch_section($section,$date)
fetch_article($url)
sections()
XML::Generator::NYTimes - generate DocBook SAX2 events for one or more New York Times articles.
use IO::AtomicFile; use XML::SAX::Writer; use XML::Generator::NYTimes;
my $output = IO::AtomicFile->open("/tmp/times.dbk","w"); my $writer = XML::SAX::Writer->new(Output=>$output); my $handler = XML::Generator::NYTimes->new(Handler=>$writer);
$handler->set_cookies_file("mozilla","/path/to/cookies"); $handler->fetch();
$output->close();
Generate DocBook SAX2 events for one or more New York Times articles.
There is no deep magic happening here. Just some good old fashioned scraping which means it may break at a later date.
This package generates DocBook SAX2 events with either a book or an article root element.
Book widgets contain the following elements :
+ book + bookinfo - title - pubdate + part (+) + partinfo - title + xref (+) @linkend + article (+)
Article widgets contain the following elements :
+ article @id + articleinfo - title + keywordset (or not) - keyword + indexterm (or not) @ type - primary + authorgroup - corpauthor OR + author - surname - firstname - othername (or not) + abstract - para - pubdate + copyright - year - holder - para (+) + mediaobject (+, or not) @id + imageobject - imagedata @fileref
new()
This package inherits from XML::SAX::Base. Please consult the docs for details.
Returns an object. Woot!
set_cookies()
This function has not been implemented yet.
set_cookies_file($browser,$path_to_cookies)
Load a file containing your New York Times cookies.
$browser must be a valid sub-class of HTTP::Cookies
set_log_dispatcher($obj)
Set one or more Log::Dispatch thingies for reporting errors and general verbosity.
set_callback($name,\&callback)
Valid callbacks are
my $nyt = XML::Generator::NYTimes->new(...); $nyt->set_callback("image",\&img_callback);
sub img_callback {
# $res is a HTTP::Request object
my $res = shift; my $src = $res->request()->url();
$src =~ m!(\d{4})/(\d{2})/(\d{2})/(.*)$!; my ($yyyy,$mm,$dd,$fname) = ($1,$2,$3,$4);
my $img_root = File::Spec->catdir("/path/to/nyt", $yyyy,$mm,$dd);
my $img_name = sprintf("%04d%02d%02d-%s",$yyyy,$mm,$dd,$fname); $img_name =~ s/\//-/g;
my $path = File::Spec->catfile($img_root,$img_name);
if (! -d $img_root) { mkpath([$img_root],1,0755); }
open FH, ">$path" || return undef;
print FH $res->content(); close FH;
return $path; }
If the callback returns a value, the following SAX2 events will be generated :
+ mediaobject @id (the URI of the image) + imageobject - imagedata @fileref (the return value of the callback)
fetch($date)
String.
Dates should be passed as ``yyyy/mm/dd''. If no date is specified then the date returned by localtime is assumed.
fetch_section($section,$date)
String, string.
Generate a DocBook book document, containing (1) part and (n) articles for $section
Dates should be passed as ``yyyy/mm/dd''. If no date is specified then the date returned by localtime is assumed.
fetch_article($url)
String.
Generate a DocBook article document for url.
sections()
Returns an array ref of sections listed by the New York Times.
This software is meant for personal use only, in accordance with the New York Times terms of usage :
"You may download or copy the Content and other downloadable items displayed on the Service for personal use only, provided that you maintain all copyright and other notices contained therein. Copying or storing of any Content for other than personal use is expressly prohibited without prior written permission from The New York Times Rights and Permissions Department, or the copyright holder identified in the copyright notice contained in the Content."
http://www.nytimes.com/ref/membercenter/help/agree.html#sect2
Play nicely.
1.0
$Date: 2004/09/19 15:59:00 $
Aaron Straup Cope
One day a friend wrote :
"You know, what I would really like in a bot, though I'm sure there are reasons this doesn't exist. is a bot that you can send a NYTimes URL to and it will spit back [or email you? does this break some rule?] the text of the article, no banners, no headlines, no cloying logins etc. I have been dealing with a particularly zealous fan who has been trying to tell me that linking to NYT articles is Bad Form since not everyone has a login, even though I do offer my own login to them ... So, I'm thinking I should learn perl in order to be able to do this."
And that was all it took.
http://aaronland.info/weblog/archive/2868
http://aaronland.info/weblog/archive/3266
http://aaronland.info/weblog/archive/4124
Copyright (c) 2002-2004, Aaron Straup Cope. All Rights Reserved.
This is free software, you may use it and distribute it under the same terms as Perl itself