Summarizing my WordPress posts using XSLT; 2008 as a PDF

It’s the time of the year for annual updates. I was thinking of reviewing all the blog posts I’d written this year. My weekly and monthly posts are incomplete, though, and I want to make sure I cover everything. I also know a few people who are slowly working their way through my archives. So I thought I’d export all of my posts from 2008 into something that people can read with fewer clicks.

If you want to skip past all the geek details, you can get the files here: 2008 blog (4.6 MB, 307 pages(!)), 2008 mostly nongeek entries (3.8 MB, 195 pages).

After some tinkering around with wptex and other modules that are supposed to make this easier, I gave up and decided to do it myself. I toyed with the idea of writing a short Ruby program that either parsed the XML or read the database, but I eventually ended up taking it as an excuse to learn XSLT, a language for transforming XML. WordPress can export posts and comments as XML. After I scrubbed my WordPress of spam and raised my PHP execution times, I downloaded the XML file and started figuring out how to get it into the form I wanted: a document organized by month, with a table of contents listing all the posts.

Here’s the main stylesheet I used:

 <xsl:stylesheet version="1.0"
                 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 xmlns:content="http://purl.org/rss/1.0/modules/content/"
                 xmlns:wp="http://wordpress.org/export/1.0/">
   <xsl:output method="html"/>
   <xsl:template match="/">
     <html><body>
       <h0>January 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jan 2008') and wp:status='publish']"/>
       <h0>February 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Feb 2008') and wp:status='publish']"/>
       <h0>March 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Mar 2008') and wp:status='publish']"/>
       <h0>April 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Apr 2008') and wp:status='publish']"/>
       <h0>May 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'May 2008') and wp:status='publish']"/>
       <h0>June 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jun 2008') and wp:status='publish']"/>
       <h0>July 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Jul 2008') and wp:status='publish']"/>
       <h0>August 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Aug 2008') and wp:status='publish']"/>
       <h0>September 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Sep 2008') and wp:status='publish']"/>
       <h0>October 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Oct 2008') and wp:status='publish']"/>
       <h0>November 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Nov 2008') and wp:status='publish']"/>
       <h0>December 2008</h0>
       <xsl:apply-templates select="/rss/channel/item[contains(pubDate, 'Dec 2008') and wp:status='publish']"/>
   </body></html>
   </xsl:template>
   <xsl:template match="//item">
     <h1><a>
       <xsl:attribute name="href">
         <xsl:value-of select="link"/>
       </xsl:attribute>
       <xsl:value-of select="title"/></a></h1>
     <div class="link"><xsl:value-of select="link"/></div>
     <div class="date"><xsl:value-of select="pubDate"/></div>
     <div class="content">
       <xsl:value-of select="content:encoded" disable-output-escaping="yes" />
     </div>
   </xsl:template>
 </xsl:stylesheet>

For the non-geek version, I replaced the template with:

   <xsl:template match="//item">
     <xsl:if test="not(category[@nicename='emacs']) and not(category[@nicename='drupal']) and not(category[@nicename='geek'])">
     <h1><a>
       <xsl:attribute name="href">
         <xsl:value-of select="link"/>
       </xsl:attribute>
       <xsl:value-of select="title"/></a></h1>
     <div class="link"><xsl:value-of select="link"/></div>
     <div class="date"><xsl:value-of select="pubDate"/></div>
     <div class="content">
       <xsl:value-of select="content:encoded" disable-output-escaping="yes" />
     </div>
     </xsl:if>
   </xsl:template>

I didn’t want to figure out how to demote all the headings in my blog posts (I have a few), so I used <h0> as my root element. I used xsltproc to transform the XML file I got from WordPress. Then I adjusted all the headings with the following bit of Emacs Lisp:

 (defun sacha/demote-all-headings ()
  (interactive)
   (while (re-search-forward "</?h\\([1-7]\\)>" nil t)
    (replace-match (number-to-string (1+ (string-to-number (match-string 1)))) nil t nil 1)))

It’s all held together with bubblegum and string, really.

2008 blog (4.6 MB, 307 pages(!)), 2008 mostly nongeek entries (3.8 MB, 195 pages)

I haven’t looked at these files much yet – I just scrolled through them quickly. No, don’t worry, I’m not going to send my 2008 update as 307 pages in the mail. ;) But it’s there so that we can flip through it or you borrow the code, and someday I’ll even figure out how to format the output neatly and everything.

Next step: I need to read all of that and highlight a couple of things that made my year.

(307 pages! Wow.)

  • flavio

    Hi Sacha, I downloaded your 2008 posts and right now I am reading some of them. I am reading Org-mode and emacs articles.

    Now I found lots of other articles very interesting too, particularly some of them that you describe how positive you see life… that’s great.

    What is the secret? How do you handle negative people, specially in your job?

    By the way, I have a suggestion to make finding posts easier : index with pdf links to the pages.

    Bye
    Flávio.

  • Walter

    Hi !

    Your post saved my day. I am new to xml / xslt, however i tried you sample and tested it with a wordpress export hosted for our intranet and it worked. However i have on issue. It seems that the line break is missing in the content section.

    Now i wonder why your line breaks are ok any mine not.

    When i have a look at my exported xml file it reads like that

    line1
    line2

    When i enter the link for the xslt file it looks like

    line1line2.

    Any ideas ?

    i downloaded you pdf file and started to read it.
    Good stuff !!

    Regards,
    Walter

  • http://andypiper.co.uk Andy Piper

    Hey Sacha, this is awesome! Couple of questions though – the XSLT you’ve given here outputs HTML, so how did you get the table of contents and what did you use to create the PDF?

    BTW this works great with exports from WP.com hosted blogs too.

    • http://sachachua.com Sacha Chua

      Oh, that was the easy part. OpenOffice.org, insert table of contents, then export as PDF. =)

      I lost a number of the linebreaks and I haven’t figured out a good way to do those without running into problems with the posts that do have linebreaks, but I guess people can figure things out and not assume that I don’t know about paragraphs. ;)

  • http://andypiper.co.uk Andy Piper

    OK so here’s a suggestion – I found that a bunch of characters were not translated nicely unless I added encoding=”iso-8859-1″ onto the xsl:output tag.

    OO.o 3.x doesn’t easily let me generate a TOC for an HTML file, unless I create a text document first, copy and paste the HTML from a Writer/Web document, and do it there. Kind of annoying.