Extract PDF highlights into an Org file with Python

| org

I've been trying to find a good workflow for highlighting interesting parts of PDFs, and then getting that into my notes as images and text in Emacs. I think I've finally figured out something that works well for me that feels natural (marking things.

I wanted to read through Prot's Emacs configuration while the kiddo played with her friends at the playground. I saved the web page as a PDF and exported it to Noteful. The PDF has 481 pages. Lots to explore! It was a bit chilly, so I had my gloves on. I used a capacitative stylus in my left hand to scroll the document and an Apple Pencil in my right hand to highlight the parts I wanted to add to my config or explore further.

Back at my computer, I used pip install pymupdf to install the PyMuPDF library. I poked around the PDF in the Python shell to see what it had, and I noticed that the highlights were drawings with fill 0.5. So I wrote this Python script to extract the images and text near that rectangle:

import fitz
import pathlib
import sys
import os

BUFFER = 5

def extract_highlights(filename, output_dir):
    doc = fitz.open(filename)
    s = "* Excerpts\n"
    for page_num, page in enumerate(doc):
        page_width = page.rect.width
        page_text = ""
        for draw_num, d in enumerate(page.get_drawings()):
            if d['fill_opacity'] == 0.5:
               rect = d['rect']
               clip_rect = fitz.Rect(0, rect.y0 - BUFFER, page_width, rect.y1 + BUFFER)
               img = page.get_pixmap(clip=clip_rect)
               img_filename = "page-%03d-%d.png" % (page_num + 1, draw_num + 1)
               img.save(os.path.join(output_dir, img_filename))
               text = page.get_text(clip=clip_rect)
               page_text = (page_text
                            + "[[file:%s]]\n#+begin_quote\n[[pdf:%s::%d][p%d]]: %s\n#+end_quote\n\n"
                            % (img_filename,
                               os.path.join("..", filename),
                               page_num + 1,
                               page_num + 1, text))
        if page_text != "":
            s += "** Page %d\n%s" % (page_num + 1, page_text)
    pathlib.Path(os.path.join(output_dir, "index.org")).write_bytes(s.encode())

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("Usage: list-highlights.py pdf-filename output-dir")
    else:
        extract_highlights(sys.argv[1], sys.argv[2])

After I opened the resulting index.org file, I used C-u C-u C-c C-x C-v (org-link-preview) to make the images appear inline throughout the whole buffer. There's a little extra text from the PDF extraction, but it's a great starting point for cleaning up or copying. The org-pdftools package lets me link to specific pages in PDFs, neat!

2026-04-02-08-14-23.png
Figure 1: Screenshot of Org Mode file with link previews

To set up org-pdftools, I used:

(use-package org-pdftools
  :hook (org-mode . org-pdftools-setup-link))

Here's my quick livestream about the script with a slightly older version that had an off-by-one bug in the page numbers and didn't have the fancy PDF links. =)

View Org source for this post