Liking cljdoc? Tell your friends :D

com.blockether.svar.internal.rlm.internal.pageindex.vision

Vision/LLM-based text extraction from documents.

Provides:

  • image->base64 - Convert BufferedImage to base64 PNG string
  • image->bytes - Convert BufferedImage to PNG byte array
  • image->bytes-region - Extract and convert a bounding-box region to PNG bytes
  • extract-image-region - Crop a BufferedImage to a bounding-box region
  • scale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensions
  • extract-text-from-image - Extract structured nodes from a single BufferedImage (vision)
  • extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)
  • extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)
  • extract-text-from-image-file - Extract from image file (vision)
  • extract-text-from-string - Extract from string content (LLM, no image rendering)
  • infer-document-title - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.

Vision/LLM-based text extraction from documents.

Provides:
- `image->base64` - Convert BufferedImage to base64 PNG string
- `image->bytes` - Convert BufferedImage to PNG byte array
- `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes
- `extract-image-region` - Crop a BufferedImage to a bounding-box region
- `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions
- `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision)
- `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision)
- `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering)
- `extract-text-from-image-file` - Extract from image file (vision)
- `extract-text-from-string` - Extract from string content (LLM, no image rendering)
- `infer-document-title` - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps.
Uses multimodal LLM for both image and text extraction.
Parallel extraction using core.async channels for PDFs.
raw docstring

BBOX_COORDINATE_SCALESclj

Bounding box coordinate scale factors by model.

Vision models return bbox coordinates in different formats:

  • Some use normalized coordinates (0-1000, 0-1, etc.)
  • Some use actual pixel coordinates (nil = no scaling needed)

This map defines the normalization scale for each model. If a model returns coords in 0-N range, set scale to N. If a model returns actual pixels, set to nil.

Bounding box coordinate scale factors by model.

Vision models return bbox coordinates in different formats:
- Some use normalized coordinates (0-1000, 0-1, etc.)
- Some use actual pixel coordinates (nil = no scaling needed)

This map defines the normalization scale for each model.
If a model returns coords in 0-N range, set scale to N.
If a model returns actual pixels, set to nil.
sourceraw docstring

DEFAULT_VISION_MODELclj

Default vision model for text extraction.

Default vision model for text extraction.
sourceraw docstring

DEFAULT_VISION_OBJECTIVEclj

Default system prompt for vision-based text extraction.

Default system prompt for vision-based text extraction.
sourceraw docstring

extract-image-regionclj

(extract-image-region image bbox)

Extracts a region from a BufferedImage and returns it as base64.

Params: image - BufferedImage. The source image. bbox - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns: String. Base64-encoded PNG of the cropped region, or nil if bbox is invalid.

Extracts a region from a BufferedImage and returns it as base64.

Params:
`image` - BufferedImage. The source image.
`bbox` - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns:
String. Base64-encoded PNG of the cropped region, or nil if bbox is invalid.
sourceraw docstring

extract-text-from-imageclj

(extract-text-from-image image
                         page-index
                         {:keys [model objective timeout-ms config]
                          :or {timeout-ms DEFAULT_VISION_TIMEOUT_MS}})

Extracts document content from a BufferedImage using vision LLM.

Uses typed node structure with parent-id references for hierarchy. Sections are logical groupings with AI-generated descriptions. Headings are separate nodes that belong to their Section.

Params: image - BufferedImage. The image to extract from. page-index - Integer. The page index (0-based). opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url (from llm-config-component). :timeout-ms - Integer, optional. HTTP timeout (default: 360000ms / 6 min).

Returns: Map with: :page/index - Integer. The page index. :page/nodes - Vector of typed document nodes (all fields namespaced as :page.node/X): - Section: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/description - Heading: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content - Paragraph: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation? - ListItem: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation? - Image: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/image-data (bytes) - Table: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/content (ASCII), :page.node/image-data (bytes) - Header: :page.node/type, :page.node/id, :page.node/content - Footer: :page.node/type, :page.node/id, :page.node/content - Metadata: :page.node/type, :page.node/id, :page.node/content

Extracts document content from a BufferedImage using vision LLM.

Uses typed node structure with parent-id references for hierarchy.
Sections are logical groupings with AI-generated descriptions.
Headings are separate nodes that belong to their Section.

 Params:
 `image` - BufferedImage. The image to extract from.
 `page-index` - Integer. The page index (0-based).
 `opts` - Map with:
   `:model` - String. Vision model to use.
   `:objective` - String. System prompt for OCR.
   `:config` - Map. LLM config with :api-key, :base-url (from llm-config-component).
   `:timeout-ms` - Integer, optional. HTTP timeout (default: 360000ms / 6 min).

 Returns:
 Map with:
   `:page/index` - Integer. The page index.
   `:page/nodes` - Vector of typed document nodes (all fields namespaced as :page.node/X):
     - Section: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/description
     - Heading: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content
     - Paragraph: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation?
     - ListItem: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation?
     - Image: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/image-data (bytes)
     - Table: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/content (ASCII), :page.node/image-data (bytes)
     - Header: :page.node/type, :page.node/id, :page.node/content
     - Footer: :page.node/type, :page.node/id, :page.node/content
     - Metadata: :page.node/type, :page.node/id, :page.node/content
sourceraw docstring

extract-text-from-image-fileclj

(extract-text-from-image-file file-path {:keys [model refine?] :as opts})

Extracts document content from an image file using vision LLM.

When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: file-path - String. Path to the image file (.png, .jpg, etc.). opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from an image file using vision LLM.

When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`file-path` - String. Path to the image file (.png, .jpg, etc.).
`opts` - Map with:
  `:model` - String. Vision model to use.
  `:objective` - String. System prompt for OCR.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.
sourceraw docstring

extract-text-from-pdfclj

(extract-text-from-pdf pdf-path
                       {:keys [model objective parallel timeout-ms config
                               refine? page-set]
                        :or {parallel 3 timeout-ms DEFAULT_VISION_TIMEOUT_MS}
                        :as opts})

Extracts document content from all pages of a PDF file using vision LLM.

Uses node-based document structure extraction. Each page contains a vector of semantic nodes (headings, paragraphs, images, tables, etc.).

Params: pdf-path - String. Path to the PDF file. opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url (from llm-config-component). :parallel - Integer. Max concurrent extractions (default: 4). :timeout-ms - Integer, optional. HTTP timeout per page (default: 180000ms / 3 min). :page-set - Set of 0-indexed page numbers to extract, or nil for all pages. When provided, only pages in the set are sent to the vision LLM.

Returns: Vector of maps, one per page: :page/index - Integer. The page number (0-based). :page/nodes - Vector of document nodes (see extract-text-from-image for node structure).

Throws: Anomaly (fault) if any page fails to extract.

Extracts document content from all pages of a PDF file using vision LLM.

Uses node-based document structure extraction. Each page contains a vector of
semantic nodes (headings, paragraphs, images, tables, etc.).

  Params:
  `pdf-path` - String. Path to the PDF file.
  `opts` - Map with:
    `:model` - String. Vision model to use.
    `:objective` - String. System prompt for OCR.
    `:config` - Map. LLM config with :api-key, :base-url (from llm-config-component).
    `:parallel` - Integer. Max concurrent extractions (default: 4).
    `:timeout-ms` - Integer, optional. HTTP timeout per page (default: 180000ms / 3 min).
    `:page-set` - Set of 0-indexed page numbers to extract, or nil for all pages.
                  When provided, only pages in the set are sent to the vision LLM.

Returns:
Vector of maps, one per page:
  `:page/index` - Integer. The page number (0-based).
  `:page/nodes` - Vector of document nodes (see extract-text-from-image for node structure).

Throws:
Anomaly (fault) if any page fails to extract.
sourceraw docstring

extract-text-from-stringclj

(extract-text-from-string content {:keys [model refine?] :as opts})

Extracts document content from string content using LLM.

Sends text directly to the multimodal LLM (no image rendering). When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: content - String. Text/markdown content to extract from. opts - Map with: :model - String. LLM model to use. :objective - String. System prompt for extraction. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from string content using LLM.

Sends text directly to the multimodal LLM (no image rendering).
When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`content` - String. Text/markdown content to extract from.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:objective` - String. System prompt for extraction.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.
sourceraw docstring

extract-text-from-text-fileclj

(extract-text-from-text-file file-path {:keys [model refine?] :as opts})

Extracts document content from a text or markdown file using LLM.

Sends text directly to the multimodal LLM (no image rendering). When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: file-path - String. Path to the text/markdown file. opts - Map with: :model - String. LLM model to use. :objective - String. System prompt for extraction. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from a text or markdown file using LLM.

Sends text directly to the multimodal LLM (no image rendering).
When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`file-path` - String. Path to the text/markdown file.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:objective` - String. System prompt for extraction.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.
sourceraw docstring

image->base64clj

(image->base64 image)

Converts a BufferedImage to a base64-encoded PNG string.

Params: image - BufferedImage. The image to convert.

Returns: String. Base64-encoded PNG data (without data:image/png;base64, prefix).

Converts a BufferedImage to a base64-encoded PNG string.

Params:
`image` - BufferedImage. The image to convert.

Returns:
String. Base64-encoded PNG data (without data:image/png;base64, prefix).
sourceraw docstring

image->bytesclj

(image->bytes image)

Converts a BufferedImage to raw PNG bytes.

Params: image - BufferedImage. The image to convert.

Returns: byte[]. Raw PNG bytes.

Converts a BufferedImage to raw PNG bytes.

Params:
`image` - BufferedImage. The image to convert.

Returns:
byte[]. Raw PNG bytes.
sourceraw docstring

image->bytes-regionclj

(image->bytes-region image bbox)

Extracts a region from a BufferedImage and returns it as PNG bytes.

Params: image - BufferedImage. The source image. bbox - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns: byte[]. PNG bytes of the cropped region, or nil if bbox is invalid.

Extracts a region from a BufferedImage and returns it as PNG bytes.

Params:
`image` - BufferedImage. The source image.
`bbox` - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns:
byte[]. PNG bytes of the cropped region, or nil if bbox is invalid.
sourceraw docstring

infer-document-titleclj

(infer-document-title pages
                      {:keys [model config timeout-ms] :or {timeout-ms 30000}})

Infers document title from extracted content using LLM.

Analyzes the document structure (headings, metadata, first paragraphs) to determine the most appropriate title.

Params: pages - Vector of page maps with :page/nodes. opts - Map with: :model - String. LLM model to use. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout (default: 30000ms).

Returns: String. The inferred document title, or nil if cannot be inferred.

Infers document title from extracted content using LLM.

Analyzes the document structure (headings, metadata, first paragraphs)
to determine the most appropriate title.

Params:
`pages` - Vector of page maps with :page/nodes.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout (default: 30000ms).

Returns:
String. The inferred document title, or nil if cannot be inferred.
sourceraw docstring

scale-and-clamp-bboxclj

(scale-and-clamp-bbox bbox width height bbox-scale)

Scales bounding box from model coordinates to pixel coordinates, adds padding, then clamps to valid image dimensions.

Different vision models return bbox in different formats:

  • GLM-4.6V: normalized 0-1000 coordinates
  • GPT-4o/Claude: actual pixel coordinates

Params: bbox - Vector of [xmin, ymin, xmax, ymax] in model coordinates. width - Integer. Image width in pixels. height - Integer. Image height in pixels. bbox-scale - Integer or nil. If set, coords are in 0-N normalized space and will be scaled to pixels. If nil, coords are already pixels.

Returns: Vector of [xmin, ymin, xmax, ymax] in actual pixels with padding, clamped to valid range, or nil if invalid.

Scales bounding box from model coordinates to pixel coordinates,
adds padding, then clamps to valid image dimensions.

Different vision models return bbox in different formats:
- GLM-4.6V: normalized 0-1000 coordinates
- GPT-4o/Claude: actual pixel coordinates

Params:
`bbox` - Vector of [xmin, ymin, xmax, ymax] in model coordinates.
`width` - Integer. Image width in pixels.
`height` - Integer. Image height in pixels.
`bbox-scale` - Integer or nil. If set, coords are in 0-N normalized space
               and will be scaled to pixels. If nil, coords are already pixels.

Returns:
Vector of [xmin, ymin, xmax, ymax] in actual pixels with padding, clamped to valid range,
or nil if invalid.
sourceraw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close