Character, word, and text extraction over PDFBox's PDFTextStripper.
PDFTextStripper already reports direction-adjusted coordinates in a top-left
origin, so char maps are built directly from getXDirAdj/getYDirAdj without
a page-height flip. Words are formed by clustering chars into lines (within
:y-tolerance) and splitting on horizontal gaps wider than :x-tolerance.
Character, word, and text extraction over PDFBox's PDFTextStripper. PDFTextStripper already reports direction-adjusted coordinates in a top-left origin, so char maps are built directly from `getXDirAdj`/`getYDirAdj` without a page-height flip. Words are formed by clustering chars into lines (within `:y-tolerance`) and splitting on horizontal gaps wider than `:x-tolerance`.
(chars doc)(chars doc {:keys [page bbox]})Vector of character maps {:text :x0 :top :x1 :bottom :font-name :font-size :page-number}. Options: :page (1-based, limit to one page) and :bbox
(keep chars whose center falls inside [x0 top x1 bottom]).
Vector of character maps `{:text :x0 :top :x1 :bottom :font-name :font-size
:page-number}`. Options: `:page` (1-based, limit to one page) and `:bbox`
(keep chars whose center falls inside `[x0 top x1 bottom]`).(text doc)(text doc
{:keys [x-tolerance y-tolerance]
:or {x-tolerance default-tolerance y-tolerance default-tolerance}
:as opts})Reconstructed text: words joined by spaces within a line, lines by newlines.
Accepts the same options as words.
Reconstructed text: words joined by spaces within a line, lines by newlines. Accepts the same options as `words`.
(words doc)(words doc
{:keys [x-tolerance y-tolerance]
:or {x-tolerance default-tolerance y-tolerance default-tolerance}
:as opts})Vector of word maps {:text :x0 :top :x1 :bottom :page-number}, reading order.
Options: :page, :bbox, :x-tolerance (default 3.0), :y-tolerance
(default 3.0).
Vector of word maps `{:text :x0 :top :x1 :bottom :page-number}`, reading order.
Options: `:page`, `:bbox`, `:x-tolerance` (default 3.0), `:y-tolerance`
(default 3.0).cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |