Using Emacs Lisp to batch-demote HTML headings for my static site

| blogging, 11ty, emacs

Assumed audience: People who have lots of HTML files used as input for a static site generator, might need to do a batch operation on them, and are open to doing that with Emacs Lisp. Which might just be me, but who knows? =)

HTML defines a hierarchy of headings going from <h1> to <h6>, which comes in especially handy when people are navigating with a screenreader or converting web pages to Org Mode. I think search engines might use them to get a sense of the page's structure, too. On my blog, the hierarchy usually goes like this:

  • <h1>: site title,
  • <h2>: blog post titles, since I put multiple blog posts on the main page and category pages (ex: blogging)
  • <h3>: blog post's subheadings, if any
  • <h4>: I rarely need subsubheadings in my main blog posts, but they're there just in case

While fiddling with my blog's CSS so that I could try this fluid type scale, I realized that the subheadings in my exported blog entries started at <h2> instead of <h3>. This meant that the outline was this:

  • Site title
    • Blog post 1
    • Subheading 1
    • Subheading 2
    • Blog post 2
    • Subheading 1
    • Subheading 2
    • Blog post 3

I wanted the outline to be this:

  • Site title
    • Blog post 1
      • Subheading 1
      • Subheading 2
    • Blog post 2
      • Subheading 1
      • Subheading 2
    • Blog post 3

This was because I hadn't changed org-html-toplevel-hlevel during my 11ty export process. To solve this for new posts, I added a new option org-11ty-toplevel-hlevel that defaults to 3 in ox-11ty.el, re-exported one of my long blog posts to test it, and confirmed that my headings now started at <h3>.

I still had all my old HTML files with the wrong levels of headings. I wrote some Emacs Lisp to shift the headings downwards (h5 to h6, h4 to h5, h3 to h4, h2 to h3) in a file if it had an <h2> in it. Regular expressions are usually not a good idea when it comes to HTML because there might be exceptions, but I figured it was a pretty small and low-risk change, so I decided not to use the full XML/DOM parsing functions. I saved all the blog posts under version control just in case I messed things up. Here's my function:

(defun my-html-shift-headings (filename)
  "Shift heading tags in FILENAME."
  (interactive "FFile: ")
  (let ((case-fold-search t)) ; make the search case-insensitive
    (with-temp-buffer
      (insert-file-contents filename)
      (goto-char (point-min))
      ;; Only modify the files where we have an h2
      (when (or (search-forward "<h2" nil t)
                (search-forward "</h2>" nil t))
        (goto-char (point-min))
        ;; Handle both opening and closing tags
        (while (re-search-forward "<\\(/\\)?h\\([2-5]\\)\\>" nil t)
          (let* ((closing-tag (match-string 1))
                 (heading-level (string-to-number (match-string 2)))
                 (new-level (1+ heading-level)))
            (replace-match (concat "<" closing-tag "h" (number-to-string new-level)))))
        (write-file filename)
        filename))))

Running it on all the source HTML files in specific subdirectories was easy with directory-files-recursively.

(dolist (dir '("~/proj/static-blog/blog"
               "~/proj/static-blog/content"))
  (mapc 'my-html-shift-headings
        (directory-files-recursively
         dir
         "\\.html\\'")))

Then I could just rebuild my blog and get all the right heading levels. Spot-checks with Inspect Element show that the headings now have the right tags, and org-web-tools-read-url-as-org now picks up the right hierarchy for the page.

Correcting the input files was easier and more efficient than modifying my 11ty template engine to shift the heading levels whenever I build my site (probably by defining a preprocessor). I could've written a NodeJS script to do that kind of file manipulation, but writing it in Emacs Lisp matched how I might think of doing it interactively. Using Emacs Lisp was also easy to test on one or two files, check the list of files matched by directory-files-recursively, and then run it on everything.

Going forward, the new org-11ty-toplevel-hlevel variable should properly modify the behaviour of Org's HTML export to get the headings at the right level. We'll see!