Liking cljdoc? Tell your friends :D

pdfplumber.core

Public API for pdfplumber-clj: open PDFs and extract text, words, characters, geometric objects, and tables as plain Clojure data.

Public API for pdfplumber-clj: open PDFs and extract text, words, characters,
geometric objects, and tables as plain Clojure data.
raw docstring

pdfplumber.document

Document loading and the error model. The PDFBox parse boundary lives here; higher namespaces work with the returned PDDocument handle.

Document loading and the error model. The PDFBox parse boundary lives here;
higher namespaces work with the returned PDDocument handle.
raw docstring

pdfplumber.geometry

Bounding-box math and coordinate conversion.

The public coordinate system has a top-left origin (matching Python pdfplumber): a bounding box is [x0 top x1 bottom] in PDF user-space points with x0 <= x1 and top <= bottom. PDFBox works in a bottom-left origin, so conversion happens here and nowhere else. All extraction code must use these helpers rather than open-coding the arithmetic.

Bounding-box math and coordinate conversion.

The public coordinate system has a top-left origin (matching Python
`pdfplumber`): a bounding box is `[x0 top x1 bottom]` in PDF user-space points
with `x0 <= x1` and `top <= bottom`. PDFBox works in a bottom-left origin, so
conversion happens here and nowhere else. All extraction code must use these
helpers rather than open-coding the arithmetic.
raw docstring

pdfplumber.objects

Geometric object extraction (lines, rectangles, curves) via a PDFGraphicsStreamEngine subclass.

PDFBox delivers path coordinates already transformed by the CTM into page space (bottom-left origin); we collect painted subpaths and flip them to the public top-left coordinate system. Only painted paths (stroked/filled) yield objects; clip-only / no-paint paths are discarded.

Geometric object extraction (lines, rectangles, curves) via a
PDFGraphicsStreamEngine subclass.

PDFBox delivers path coordinates already transformed by the CTM into page
space (bottom-left origin); we collect painted subpaths and flip them to the
public top-left coordinate system. Only painted paths (stroked/filled) yield
objects; clip-only / no-paint paths are discarded.
raw docstring

pdfplumber.page

Lightweight cropped page views. A view carries the document handle, a page number, and a crop bbox; extraction functions accept it in place of a document handle and restrict their output to the bbox. Nothing is copied or translated — a view is just resolved into :page/:bbox options.

Lightweight cropped page views. A view carries the document handle, a page
number, and a crop bbox; extraction functions accept it in place of a
document handle and restrict their output to the bbox. Nothing is copied or
translated — a view is just resolved into `:page`/`:bbox` options.
raw docstring

pdfplumber.table

Table extraction. The :lines strategy reconstructs a grid from ruling lines (explicit line objects plus rectangle edges): near-collinear edges are snapped together, grid intersections are found, and cells are the rectangles whose four corners are all intersections. Words are assigned to cells by center.

Table extraction. The `:lines` strategy reconstructs a grid from ruling lines
(explicit line objects plus rectangle edges): near-collinear edges are snapped
together, grid intersections are found, and cells are the rectangles whose
four corners are all intersections. Words are assigned to cells by center.
raw docstring

pdfplumber.text

Character, word, and text extraction over PDFBox's PDFTextStripper.

PDFTextStripper already reports direction-adjusted coordinates in a top-left origin, so char maps are built directly from getXDirAdj/getYDirAdj without a page-height flip. Words are formed by clustering chars into lines (within :y-tolerance) and splitting on horizontal gaps wider than :x-tolerance.

Character, word, and text extraction over PDFBox's PDFTextStripper.

PDFTextStripper already reports direction-adjusted coordinates in a top-left
origin, so char maps are built directly from `getXDirAdj`/`getYDirAdj` without
a page-height flip. Words are formed by clustering chars into lines (within
`:y-tolerance`) and splitting on horizontal gaps wider than `:x-tolerance`.
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close