December 20, 2008

Summarizing my WordPress posts using XSLT; 2008 as a PDF

December 20, 2008 - Categories: blogging, geek, wordpress

It’s the time of the year for annual updates. I was thinking of reviewing all the blog posts I’d written this year. My weekly and monthly posts are incomplete, though, and I want to make sure I cover everything. I also know a few people who are slowly working their way through my archives. So I thought I’d export all of my posts from 2008 into something that people can read with fewer clicks.

If you want to skip past all the geek details, you can get the files here: 2008 blog (4.6 MB, 307 pages(!)), 2008 mostly nongeek entries (3.8 MB, 195 pages).

After some tinkering around with wptex and other modules that are supposed to make this easier, I gave up and decided to do it myself. I toyed with the idea of writing a short Ruby program that either parsed the XML or read the database, but I eventually ended up taking it as an excuse to learn XSLT, a language for transforming XML. WordPress can export posts and comments as XML. After I scrubbed my WordPress of spam and raised my PHP execution times, I downloaded the XML file and started figuring out how to get it into the form I wanted: a document organized by month, with a table of contents listing all the posts.

Here’s the main stylesheet I used:

 <xsl:stylesheet version="1.0"
                 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 xmlns:content="http://purl.org/rss/1.0/modules/content/"
                 xmlns:wp="http://wordpress.org/export/1.0/">
   <xsl:output method="html"/>
   <xsl:template match="/">
     <html><body>
       <h0>January 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jan 2008') and wp:status='publish']"/>
       <h0>February 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Feb 2008') and wp:status='publish']"/>
       <h0>March 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Mar 2008') and wp:status='publish']"/>
       <h0>April 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Apr 2008') and wp:status='publish']"/>
       <h0>May 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'May 2008') and wp:status='publish']"/>
       <h0>June 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jun 2008') and wp:status='publish']"/>
       <h0>July 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jul 2008') and wp:status='publish']"/>
       <h0>August 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Aug 2008') and wp:status='publish']"/>
       <h0>September 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Sep 2008') and wp:status='publish']"/>
       <h0>October 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Oct 2008') and wp:status='publish']"/>
       <h0>November 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Nov 2008') and wp:status='publish']"/>
       <h0>December 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Dec 2008') and wp:status='publish']"/>
   </body></html>
   </xsl:template>
   <xsl:template match="//item">
     <h1><a>
       <xsl:attribute name="href">
         <xsl:value-of select="link"/>
       </xsl:attribute>
       <xsl:value-of select="title"/></a></h1>
     <div class="link"><xsl:value-of select="link"/></div>
     <div class="date"><xsl:value-of select="pubDate"/></div>
     <div class="content">
       <xsl:value-of select="content:encoded" disable-output-escaping="yes" />
     </div>
   </xsl:template>
 </xsl:stylesheet>

For the non-geek version, I replaced the template with:

   <xsl:template match="//item">
     <xsl:if test="not(category[@nicename='emacs']) and not(category[@nicename='drupal']) and not(category[@nicename='geek'])">
     <h1><a>
       <xsl:attribute name="href">
         <xsl:value-of select="link"/>
       </xsl:attribute>
       <xsl:value-of select="title"/></a></h1>
     <div class="link"><xsl:value-of select="link"/></div>
     <div class="date"><xsl:value-of select="pubDate"/></div>
     <div class="content">
       <xsl:value-of select="content:encoded" disable-output-escaping="yes" />
     </div>
     </xsl:if>
   </xsl:template>

I didn’t want to figure out how to demote all the headings in my blog posts (I have a few), so I used <h0> as my root element. I used xsltproc to transform the XML file I got from WordPress. Then I adjusted all the headings with the following bit of Emacs Lisp:

 (defun sacha/demote-all-headings ()
  (interactive)
   (while (re-search-forward "</?h\\([1-7]\\)>" nil t)
    (replace-match (number-to-string (1+ (string-to-number (match-string 1)))) nil t nil 1)))

It’s all held together with bubblegum and string, really.

2008 blog (4.6 MB, 307 pages(!)), 2008 mostly nongeek entries (3.8 MB, 195 pages)

I haven’t looked at these files much yet – I just scrolled through them quickly. No, don’t worry, I’m not going to send my 2008 update as 307 pages in the mail. ;) But it’s there so that we can flip through it or you borrow the code, and someday I’ll even figure out how to format the output neatly and everything.

Next step: I need to read all of that and highlight a couple of things that made my year.

(307 pages! Wow.)