Emacs and dom.el: quick notes on parsing HTML and turning DOMs back into HTML

| elisp

libxml-parse-html-region turns HTML into a DOM (document object model). There's also xml-parse-file and xml-parse-region. xml-parse-string actually parses the character data at point and returns it as a string instead of parsing a string as a parameter. If you have a string and you want to parse it, insert it into a temporary buffer and use libxml-parse-html-region or xml-parse-region.

(let ((s "<span>Hello world</span>")
      dom)
  (setq dom
        (with-temp-buffer
          (insert s)
          (libxml-parse-html-region))))
(html nil (body nil (span nil Hello world)))

Then you can use functions like dom-by-tag, dom-search, dom-attr, dom-children, etc. If you need to make a deep copy of the DOM, you can use copy-tree.

Turning the DOM back into HTML can be a little tricky. By default, dom-print escapes & in attributes, which could mess up things like href:

  (with-temp-buffer
    (dom-print (dom-node 'a '((href . "https://example.com?a=b&c=d"))))
     (buffer-string))
  <a href="https://example.com?a=b&amp;c=d" />

shr-dom-print handles & correctly, but it adds spaces in between elements. Also, you need to escape HTML entities in text, maybe with org-html-encode-plain-text.

  (with-temp-buffer
    (shr-dom-print
      (dom-node 'p nil
                (dom-node 'span nil "hello")
                (dom-node 'span nil "world")
                (dom-node 'a '((href . "https://example.com?a=b&c=d"))
                          (org-html-encode-plain-text "text & stuff"))))
    (buffer-string))
  <p> <span>hello</span> <span>world</span> <a href="https://example.com?a=b&c=d">text &amp; stuff</a></p>

svg-print does the right thing when it comes to href and tags, but you need to escape HTML entities yourself, maybe with org-html-encode-plain-text.

(with-temp-buffer
  (svg-print
   (dom-node 'p nil
             (dom-node 'span nil "hello")
             (dom-node 'span nil "world")
             (dom-node 'a '((href . "https://example.com?a=b&c=d"))
                       (org-html-encode-plain-text "text & stuff"))))
  (buffer-string))
  <p><span>hello</span><span>world</span><a href="https://example.com?a=b&c=d">text &amp; stuff</a></p>

Looks like I'll be using svg-print for more than just SVGs.

Relevant Emacs info pages:

View org source for this post