NAME
SYNOPSIS
DESCRIPTION
DOCUMENT STRUCTURE
PACKAGE METHODS

$pkg = __PACKAGE__->new()

OBJECT METHODS

$pkg->set_cookies()
$pkg->set_cookies_file($browser,$path_to_cookies)
$pkg->set_log_dispatcher($obj)
$obj->set_callback($name,\&callback)
$pkg->fetch($date)
$pkg->fetch_section($section,$date)
$pkg->fetch_article($url)
$pkg->sections()

TERMS OF USAGE
VERSION
DATE
AUTHOR
BACKGROUND
SEE ALSO
TO DO
BUGS
LICENSE

NAME

XML::Generator::NYTimes - generate DocBook SAX2 events for one or more New York Times articles.

SYNOPSIS

 use IO::AtomicFile;
 use XML::SAX::Writer;
 use XML::Generator::NYTimes;

 my $output  = IO::AtomicFile->open("/tmp/times.dbk","w");
 my $writer  = XML::SAX::Writer->new(Output=>$output);
 my $handler = XML::Generator::NYTimes->new(Handler=>$writer);

 $handler->set_cookies_file("mozilla","/path/to/cookies");
 $handler->fetch();

 $output->close();

DESCRIPTION

Generate DocBook SAX2 events for one or more New York Times articles.

There is no deep magic happening here. Just some good old fashioned scraping which means it may break at a later date.

DOCUMENT STRUCTURE

This package generates DocBook SAX2 events with either a book or an article root element.

Book widgets contain the following elements :

 + book
   + bookinfo
     - title
     - pubdate
   + part (+)
     + partinfo
       - title
     + xref (+)
       @linkend
     + article (+)

Article widgets contain the following elements :

 + article
   @id
   + articleinfo
     - title
     + keywordset (or not)
       - keyword
     + indexterm (or not)
       @ type
       - primary
     + authorgroup
       - corpauthor
       OR
       + author
        - surname
        - firstname
        - othername (or not)
     + abstract
       - para
     - pubdate
     + copyright
       - year
       - holder
   - para (+)
   + mediaobject (+, or not)
     @id
     + imageobject
       - imagedata
         @fileref

PACKAGE METHODS

$pkg = PACKAGE->`new()`

This package inherits from XML::SAX::Base. Please consult the docs for details.

Returns an object. Woot!

OBJECT METHODS

$pkg->`set_cookies()`

This function has not been implemented yet.

$pkg->`set_cookies_file($browser,$path_to_cookies)`

Load a file containing your New York Times cookies.

$browser must be a valid sub-class of HTTP::Cookies

$pkg->`set_log_dispatcher($obj)`

Set one or more Log::Dispatch thingies for reporting errors and general verbosity.

$obj->`set_callback($name,\&callback)`

Valid callbacks are

image

 my $nyt = XML::Generator::NYTimes->new(...);
 $nyt->set_callback("image",\&img_callback);

 sub img_callback {

    # $res is a HTTP::Request object

    my $res = shift;
    my $src = $res->request()->url();

    $src =~ m!(\d{4})/(\d{2})/(\d{2})/(.*)$!;
    my ($yyyy,$mm,$dd,$fname) = ($1,$2,$3,$4);

    my $img_root = File::Spec->catdir("/path/to/nyt",
                                      $yyyy,$mm,$dd);

    my $img_name = sprintf("%04d%02d%02d-%s",$yyyy,$mm,$dd,$fname);
    $img_name =~ s/\//-/g;

    my $path = File::Spec->catfile($img_root,$img_name);

    if (! -d $img_root) {
        mkpath([$img_root],1,0755);
    }

    open FH, ">$path"
        || return undef;

    print FH $res->content();
    close FH;

    return $path;
 }

If the callback returns a value, the following SAX2 events will be generated :

 + mediaobject
   @id                    (the URI of the image)
   + imageobject
     - imagedata
       @fileref           (the return value of the callback)

$pkg->`fetch($date)`

String.

Dates should be passed as ``yyyy/mm/dd''. If no date is specified then the date returned by localtime is assumed.

$pkg->`fetch_section($section,$date)`

String, string.

Generate a DocBook book document, containing (1) part and (n) articles for $section

Dates should be passed as ``yyyy/mm/dd''. If no date is specified then the date returned by localtime is assumed.

$pkg->`fetch_article($url)`

String.

Generate a DocBook article document for url.

$pkg->`sections()`

Returns an array ref of sections listed by the New York Times.

TERMS OF USAGE

This software is meant for personal use only, in accordance with the New York Times terms of usage :

 "You may download or copy the Content and other downloadable items displayed on the 
  Service for personal use only, provided that you maintain all copyright and other 
  notices contained therein. Copying or storing of any Content for other than personal 
  use is expressly prohibited without prior written permission from The New York Times 
  Rights and Permissions Department, or the copyright holder identified in the copyright 
  notice contained in the Content."

  http://www.nytimes.com/ref/membercenter/help/agree.html#sect2

Play nicely.

VERSION

1.0

DATE

$Date: 2004/09/19 15:59:00 $

AUTHOR

Aaron Straup Cope

BACKGROUND

One day a friend wrote :

 "You know, what I would really like in a bot, though I'm sure there are
  reasons this doesn't exist. is a bot that you can send a NYTimes URL
  to and it will spit back [or email you? does this break some rule?] the
  text of the article, no banners, no headlines, no cloying logins etc.
  I have been dealing with a particularly zealous fan who has been trying
  to tell me that linking to NYT articles is Bad Form since not everyone
  has a login, even though I do offer my own login to them ... So, I'm
  thinking I should learn perl in order to be able to do this."

And that was all it took.

TO DO

set_cookies()

BUGS

LICENSE

This is free software, you may use it and distribute it under the same terms as Perl itself