Emacs and dom.el: quick notes on parsing HTML and turning DOMs back into HTML
| elisp
libxml-parse-html-region
turns HTML into a DOM
(document object model). There's also
xml-parse-file
and xml-parse-region
.
xml-parse-string
actually parses the character
data at point and returns it as a string instead
of parsing a string as a parameter. If you have a
string and you want to parse it, insert it into a
temporary buffer and use
libxml-parse-html-region
or xml-parse-region
.
(let ((s "<span>Hello world</span>")
dom)
(setq dom
(with-temp-buffer
(insert s)
(libxml-parse-html-region))))
(html nil (body nil (span nil Hello world)))
Then you can use functions like dom-by-tag
,
dom-search
, dom-attr
, dom-children
, etc. If
you need to make a deep copy of the DOM, you can
use copy-tree
.
Turning the DOM back into HTML can be a little
tricky. By default, dom-print
escapes & in
attributes, which could mess up things like href:
(with-temp-buffer
(dom-print (dom-node 'a '((href . "https://example.com?a=b&c=d"))))
(buffer-string))
<a href="https://example.com?a=b&c=d" />
shr-dom-print
handles & correctly, but it adds spaces in between elements. Also, you need to escape HTML entities in text, maybe with org-html-encode-plain-text
.
(with-temp-buffer
(shr-dom-print
(dom-node 'p nil
(dom-node 'span nil "hello")
(dom-node 'span nil "world")
(dom-node 'a '((href . "https://example.com?a=b&c=d"))
(org-html-encode-plain-text "text & stuff"))))
(buffer-string))
<p> <span>hello</span> <span>world</span> <a href="https://example.com?a=b&c=d">text & stuff</a></p>
svg-print
does the right thing when it comes to href and tags, but you need to escape HTML entities yourself, maybe with org-html-encode-plain-text
.
(with-temp-buffer
(svg-print
(dom-node 'p nil
(dom-node 'span nil "hello")
(dom-node 'span nil "world")
(dom-node 'a '((href . "https://example.com?a=b&c=d"))
(org-html-encode-plain-text "text & stuff"))))
(buffer-string))
<p><span>hello</span><span>world</span><a href="https://example.com?a=b&c=d">text & stuff</a></p>
Looks like I'll be using svg-print
for more than just SVGs.
Relevant Emacs info pages: