Liking cljdoc? Tell your friends :D

pdfplumber-clj

Plain-data PDF extraction for Clojure. Pull text, words, characters, geometric objects, and tables out of digitally generated PDFs as EDN/JSON-friendly maps and vectors — the Clojure counterpart to Python's pdfplumber, built on Apache PDFBox.

Stack

Status

Early release (0.1.1). The extraction API (text, words, chars, objects, tables, crop) is in place and validated against the Python pdfplumber corpus; it may still evolve before 1.0.

Install

deps.edn

net.clojars.savya/pdfplumber-clj {:mvn/version "0.1.1"}

Leiningen

[net.clojars.savya/pdfplumber-clj "0.1.1"]

Requires JDK 17+.

Quickstart

(require '[pdfplumber.core :as pdf])

(pdf/with-pdf [doc "statement.pdf"]
  (pdf/text doc {:page 1}))           ; => "Account statement\n..."

(pdf/with-pdf [doc "statement.pdf"]
  (pdf/words doc {:page 1}))          ; => [{:text "Account" :x0 .. :top .. :x1 .. :bottom ..} ...]

(pdf/with-pdf [doc "invoice.pdf"]
  (pdf/extract-table doc {:page 1 :strategy :lines}))

Coordinate system

Public coordinates use a top-left origin (matching pdfplumber), with bounding boxes as [x0 top x1 bottom] in PDF user-space points. PDFBox's native bottom-left coordinates are converted internally.

Scope

In: text/word/char extraction, page geometry, crop/filter, ruling-line and text-aligned table extraction, deterministic plain-data output.

Out (v1): PDF generation, OCR, scanned/image PDFs, AcroForm extraction, layout ML. Table :text strategy is heuristic and intended for digitally generated PDFs.

License

Distributed under the Eclipse Public License 2.0.

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close