Getting an Org link URL from a string; debugging regex groups

| elisp, org

Sometimes I want to get the URL from a string whether the string contains a bare URL (https://example.com) or an Org bracketed link ([[https://example.com]] or [[https://example.com][Example]], ignoring any extra non-link text (blah https://example.com blah blah). org-link-any-re seemed like the right regular expression to use, but I started to get a little dizzy looking at all the parenthesis and I couldn't figure out which matching group to use. I tried using re-builder. That highlighted the groups in different colours, but I didn't know what the colours meant. All the matching information is in (match-data), but integer pairs can be a little hard to translate back to substrings. So I wrote an Emacs Lisp function to gave me the matching groups:

(defun my-match-groups (&optional object)
  "Return the matching groups, good for debugging regexps."
  (seq-map-indexed (lambda (entry i)
                     (list i entry
                           (and (car entry)
                                (if object
                                    (substring object (car entry) (cadr entry))
                                  (buffer-substring (car entry) (cadr entry))))))
                   (seq-partition
                    (match-data t)
                    2)))

There's probably a standard way to do this, but I couldn't figure out how to find it.

Anyway, if I give it a string with a bracketed link, I can tell that the URL ends up in group 2:

(let ((text "blah [[https://example.com][example]] blah blah"))
  (when (string-match org-link-any-re text)
    (pp-to-string (my-match-groups text))))
((0 (5 37) "[[https://example.com][example]]")
 (1 (5 37) "[[https://example.com][example]]")
 (2 (7 26) "https://example.com")
 (3 (28 35) "example"))

When I use a string with a bare link, I can see that the URL ends up in group 7:

(let ((text "blah https://example.com blah blah"))
  (when (string-match org-link-any-re text)
    (pp-to-string (my-match-groups text))))
((0 (5 24) "https://example.com")
 (1 (nil nil) nil) (2 (nil nil) nil)
 (3 (nil nil) nil) (4 (nil nil) nil)
 (5 (nil nil) nil) (6 (nil nil) nil)
 (7 (5 24) "https://example.com")
 (8 (5 10) "https") (9 (11 24) "//example.com"))

This makes it so much easier to refer to the right capture group. So now I can use those groups to extract the URL from a string:

(defun my-org-link-url-from-string (s)
  "Return the link URL from S."
  (when (string-match org-link-any-re s)
    (or
     (match-string 7 s)
       (match-string 2 s))))

This is handy when I summarize Emacs News links from Mastodon or from my inbox. Sometimes I add extra text after a link that I've captured from my phone, and I don't want that included in the URL. Sometimes I have a bracketed link that I've copied from org-capture note. Now I don't have to worry about the format. I can just grab the link I want.

View org source for this post
You can comment with Disqus or you can e-mail me at sacha@sachachua.com.