today | current | recent | random ... categories | search ... who ... syndication

A rose is a rose, except when you call it XHTML

For a variety of reasons, I've started baking all the web pages for a few sites. Any bells and whistles are just that; all the really important stuff is right there in the source file on the assumption that sooner or later the magic server tools will bork out. This means that things like copyright notices need to be updated by 'hand'. I knew this when I started and today I sat down to write a little tool to do the heavy lifting for me. This is, after all, the magic lala-land we call XML/XHTML, right? I am generally a SAX-guy but decided this was a good opportunity to introduce myself to the DOM-ishness of XML::LibXML. You know, something simple like :



my $parser = XML::LibXML->new();



my $doc = $parser->parse_html_file("/path/to/file");



my ($span) = $doc->findnodes("/xpath/to/span/with/curyear");







# I happen to know that the first child 



# is a text node containing the year



my $oldyear = $span->firstChild();



$oldyear->replaceNode(XML::LibXML::Text->new("2003"));







# Write changes to disk



Simple, right? Keen observers will have already noticed that I had to read in the document using the parse_html_file method. If your root element says "html", regardless of whether or not the DOCTYPE says XHTML, it is seemingly impossible to parse a document using the standard parse_* methods. Which means that by the time you get around to writing your DOM to disk you cant do things like...drumroll...include an XML declaration. And if you decide to simply write the declaration out by hand, don't bother trying to call any of the encoding methods. Maybe there is some deeper magic I have yet to learn but all I was able to do was make the Perl interpreter dump core. Still with me? Okay, so we're going to write the declaration by hand and just assume we know what we're dealing with when it comes to encodings. Now remember the bit about being in an HTML context? Logically, the thing to do is follow the docs, call the toStringHTML method and hope for the best. The best in this case is a terrible journey back to 1998 because all the singletons in the well-formed documents you've laboured over are suddenly left open and dangling. And don't bother throwing caution to wind and just calling toString , not if you use the clever C-style comments hack to hide the <![CDATA[]]> blocks so that your XHTML can valid and be understood by a web browser. Who the fuck knows why, but libxml will turn in to this :



<![CDATA[



 /* <![CDATA[ */



 @import url(some.css)



 /* ]]> */



]]>



Oh yeah, and it will include an XML declaration for you. So long as you dont mind that it doesn't specify the encoding. Ugh.

meta

 
Deathtrap, #1 ←  → "The trick is to mine the 'iTunes Music Library.xml' file"