Making my Emacs-related blog posts available for offline reading

Posted: - Modified: | emacs

Deepak Tripathi wanted to know how to download all of my Emacs-related posts for offline reading. It makes sense to put together something like that. Xah Lee even charges for an organized ZIP copy of his site (and he’s put together a lot of resources). I like putting together free/pay-what-you-want things, so I figured I’d add my blog posts to my git repository of Emacs notes.

My blog runs on WordPress. It has a whole bunch of other posts in it, so a straightforward Jekyll import wouldn’t do the trick. I had previously modified my WordPress theme to add a special ?dump=1 parameter for single post pages so that I could use it to archive pages. The first thing I needed to do was to come up with a list of Emacs-related blog posts.

Fortunately, already lists all the pages in the Emacs category. I copied the HTML source for the list and started tinkering with the text. It was a good excuse to try the visual-regexp package – the vr/replace command made working with match groups much easier. Eventually I ended up with a list that looked like this:

wget -p
wget -p
wget -p
wget -p

This downloaded the posts and included images. Next, I wanted to process the downloaded HTML pages and turn them into Org files, since I’m more familiar with Org than with Markdown. Pandoc to the rescue! I took the output of find -name \*.html\* and processed it with lots more vr/replace

dos2unix ./; (echo "<html><body>"; cat ./; echo "</body></html>") > test.html; pandoc test.html -o
dos2unix ./; (echo "<html><body>"; cat ./; echo "</body></html>") > test.html; pandoc test.html -o
dos2unix ./; (echo "<html><body>"; cat ./; echo "</body></html>") > test.html; pandoc test.html -o
dos2unix ./; (echo "<html><body>"; cat ./; echo "</body></html>") > test.html; pandoc test.html -o

This is a pretty long command line because I didn’t bother with writing a shell script to process files. When I tried running pandoc on the HTML snippets, it choked because the file didn’t have html or body tags, so I ended up adding them with echo before processing them with pandoc.

Anyway, now I had a lot of Org files. I wanted to rename them and clean up the titles. I used this totally hackish bit of code to process all the files in the directory, changing the header to a title and adding the day to the filename (just in case I want to move to Jekyll someday).

(defun sacha/process-blog-posts ()
  (mapcar (lambda (x)
            (find-file x)
            ;; Set the title
            (goto-char (point-min))
            (when (looking-at "\\(\\*\\*\\)")
              (replace-match "#+TITLE:"))
            ;; Rename the file
            (when (re-search-forward "\\([0-9]+\\)\\(th\\|nd\\|st\\|rd\\), \\([0-9]+\\)")
              (let ((day (match-string 1)))
                  (concat "/processed\\&" day "-")
          (directory-files "." nil ".org$")))

I also wanted to change the image references to use the images that wget had downloaded into my uploads directory. This time I used wgrep, which is awesome. The wgrep package makes grep results editable if you use wgrep-change-to-wgrep-mode (bound to C-c C-p by default, but you can change wgrep-enable-key). If you combine that with vr/replace from visual-regexp or some keyboard macros, you can edit a whole lot of things quickly. Save the changes with C-x C-s (wgrep-finish-edit), and you’re done!

So now I had a bunch of Org files that were in reasonable shape. I wanted to regenerate the HTML pages so that people could browse them even if they weren’t familiar with Org. I set up the relevant org-publish-project-alist entries in a build-site.el that a Makefile could use, and I published the project.

Going from HTML to Org to HTML meant losing some information and lots of my blog posts need reviewing, but it’s a good start. Since I draft many of my Emacs-related blog posts in Org anyway, I can copy the Org source into my emacs-notes repository going forward.

Anyway, there you have it for your grepping pleasure!

You can comment with Disqus or you can e-mail me at