Plain-data PDF extraction for Clojure. Pull text, words, characters, geometric
objects, and tables out of digitally generated PDFs as EDN/JSON-friendly maps and
vectors — the Clojure counterpart to Python's pdfplumber,
built on Apache PDFBox.
Early release (0.1.1). The extraction API (text, words, chars, objects,
tables, crop) is in place and validated against the Python pdfplumber corpus;
it may still evolve before 1.0.
deps.edn
net.clojars.savya/pdfplumber-clj {:mvn/version "0.1.1"}
Leiningen
[net.clojars.savya/pdfplumber-clj "0.1.1"]
Requires JDK 17+.
(require '[pdfplumber.core :as pdf])
(pdf/with-pdf [doc "statement.pdf"]
(pdf/text doc {:page 1})) ; => "Account statement\n..."
(pdf/with-pdf [doc "statement.pdf"]
(pdf/words doc {:page 1})) ; => [{:text "Account" :x0 .. :top .. :x1 .. :bottom ..} ...]
(pdf/with-pdf [doc "invoice.pdf"]
(pdf/extract-table doc {:page 1 :strategy :lines}))
Public coordinates use a top-left origin (matching pdfplumber), with bounding
boxes as [x0 top x1 bottom] in PDF user-space points. PDFBox's native bottom-left
coordinates are converted internally.
In: text/word/char extraction, page geometry, crop/filter, ruling-line and text-aligned table extraction, deterministic plain-data output.
Out (v1): PDF generation, OCR, scanned/image PDFs, AcroForm extraction, layout ML.
Table :text strategy is heuristic and intended for digitally generated PDFs.
Copyright © 2026 Savyasachi.
Distributed under the Eclipse Public License 2.0.
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |