com.blockether.svar.internal.rlm.internal.pageindex.vision

Vision/LLM-based text extraction from documents.

Provides:

image->base64 - Convert BufferedImage to base64 PNG string
image->bytes - Convert BufferedImage to PNG byte array
image->bytes-region - Extract and convert a bounding-box region to PNG bytes
extract-image-region - Crop a BufferedImage to a bounding-box region
scale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensions
extract-text-from-image - Extract structured nodes from a single BufferedImage (vision)
extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)
extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)
extract-text-from-image-file - Extract from image file (vision)
extract-text-from-string - Extract from string content (LLM, no image rendering)
infer-document-title - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.

Vision/LLM-based text extraction from documents.

Provides:
- `image->base64` - Convert BufferedImage to base64 PNG string
- `image->bytes` - Convert BufferedImage to PNG byte array
- `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes
- `extract-image-region` - Crop a BufferedImage to a bounding-box region
- `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions
- `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision)
- `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision)
- `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering)
- `extract-text-from-image-file` - Extract from image file (vision)
- `extract-text-from-string` - Extract from string content (LLM, no image rendering)
- `infer-document-title` - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps.
Uses multimodal LLM for both image and text extraction.
Parallel extraction using core.async channels for PDFs.

raw docstring

BBOX_COORDINATE_SCALES^clj

Bounding box coordinate scale factors by model.

Vision models return bbox coordinates in different formats:

Some use normalized coordinates (0-1000, 0-1, etc.)
Some use actual pixel coordinates (nil = no scaling needed)

This map defines the normalization scale for each model. If a model returns coords in 0-N range, set scale to N. If a model returns actual pixels, set to nil.

Bounding box coordinate scale factors by model.

Vision models return bbox coordinates in different formats:
- Some use normalized coordinates (0-1000, 0-1, etc.)
- Some use actual pixel coordinates (nil = no scaling needed)

This map defines the normalization scale for each model.
If a model returns coords in 0-N range, set scale to N.
If a model returns actual pixels, set to nil.

source raw docstring

DEFAULT_VISION_MODEL^clj

Default vision model for text extraction.

Default vision model for text extraction.

source raw docstring

DEFAULT_VISION_OBJECTIVE^clj

Default system prompt for vision-based text extraction.

Default system prompt for vision-based text extraction.

source raw docstring

extract-image-region^clj

(extract-image-region image bbox)

Extracts a region from a BufferedImage and returns it as base64.

Params: image - BufferedImage. The source image. bbox - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns: String. Base64-encoded PNG of the cropped region, or nil if bbox is invalid.

Extracts a region from a BufferedImage and returns it as base64.

Params:
`image` - BufferedImage. The source image.
`bbox` - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns:
String. Base64-encoded PNG of the cropped region, or nil if bbox is invalid.

source raw docstring

extract-text-from-image^clj

(extract-text-from-image image
                         page-index
                         {:keys [model objective timeout-ms config]
                          :or {timeout-ms DEFAULT_VISION_TIMEOUT_MS}})

Extracts document content from a BufferedImage using vision LLM.

Uses typed node structure with parent-id references for hierarchy. Sections are logical groupings with AI-generated descriptions. Headings are separate nodes that belong to their Section.

Params: image - BufferedImage. The image to extract from. page-index - Integer. The page index (0-based). opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url (from llm-config-component). :timeout-ms - Integer, optional. HTTP timeout (default: 360000ms / 6 min).

Returns: Map with: :page/index - Integer. The page index. :page/nodes - Vector of typed document nodes (all fields namespaced as :page.node/X): - Section: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/description - Heading: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content - Paragraph: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation? - ListItem: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation? - Image: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/image-data (bytes) - Table: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/content (ASCII), :page.node/image-data (bytes) - Header: :page.node/type, :page.node/id, :page.node/content - Footer: :page.node/type, :page.node/id, :page.node/content - Metadata: :page.node/type, :page.node/id, :page.node/content

Extracts document content from a BufferedImage using vision LLM.

Uses typed node structure with parent-id references for hierarchy.
Sections are logical groupings with AI-generated descriptions.
Headings are separate nodes that belong to their Section.

 Params:
 `image` - BufferedImage. The image to extract from.
 `page-index` - Integer. The page index (0-based).
 `opts` - Map with:
   `:model` - String. Vision model to use.
   `:objective` - String. System prompt for OCR.
   `:config` - Map. LLM config with :api-key, :base-url (from llm-config-component).
   `:timeout-ms` - Integer, optional. HTTP timeout (default: 360000ms / 6 min).

 Returns:
 Map with:
   `:page/index` - Integer. The page index.
   `:page/nodes` - Vector of typed document nodes (all fields namespaced as :page.node/X):
     - Section: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/description
     - Heading: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content
     - Paragraph: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation?
     - ListItem: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/level, :page.node/content, :page.node/continuation?
     - Image: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/image-data (bytes)
     - Table: :page.node/type, :page.node/id, :page.node/parent-id, :page.node/kind, :page.node/bbox, :page.node/caption, :page.node/description, :page.node/content (ASCII), :page.node/image-data (bytes)
     - Header: :page.node/type, :page.node/id, :page.node/content
     - Footer: :page.node/type, :page.node/id, :page.node/content
     - Metadata: :page.node/type, :page.node/id, :page.node/content

source raw docstring

extract-text-from-image-file^clj

(extract-text-from-image-file file-path {:keys [model refine?] :as opts})

Extracts document content from an image file using vision LLM.

When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: file-path - String. Path to the image file (.png, .jpg, etc.). opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from an image file using vision LLM.

When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`file-path` - String. Path to the image file (.png, .jpg, etc.).
`opts` - Map with:
  `:model` - String. Vision model to use.
  `:objective` - String. System prompt for OCR.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.

source raw docstring

extract-text-from-pdf^clj

(extract-text-from-pdf pdf-path
                       {:keys [model objective parallel timeout-ms config
                               refine? page-set]
                        :or {parallel 3 timeout-ms DEFAULT_VISION_TIMEOUT_MS}
                        :as opts})

Extracts document content from all pages of a PDF file using vision LLM.

Uses node-based document structure extraction. Each page contains a vector of semantic nodes (headings, paragraphs, images, tables, etc.).

Params: pdf-path - String. Path to the PDF file. opts - Map with: :model - String. Vision model to use. :objective - String. System prompt for OCR. :config - Map. LLM config with :api-key, :base-url (from llm-config-component). :parallel - Integer. Max concurrent extractions (default: 4). :timeout-ms - Integer, optional. HTTP timeout per page (default: 180000ms / 3 min). :page-set - Set of 0-indexed page numbers to extract, or nil for all pages. When provided, only pages in the set are sent to the vision LLM.

Returns: Vector of maps, one per page: :page/index - Integer. The page number (0-based). :page/nodes - Vector of document nodes (see extract-text-from-image for node structure).

Throws: Anomaly (fault) if any page fails to extract.

Extracts document content from all pages of a PDF file using vision LLM.

Uses node-based document structure extraction. Each page contains a vector of
semantic nodes (headings, paragraphs, images, tables, etc.).

  Params:
  `pdf-path` - String. Path to the PDF file.
  `opts` - Map with:
    `:model` - String. Vision model to use.
    `:objective` - String. System prompt for OCR.
    `:config` - Map. LLM config with :api-key, :base-url (from llm-config-component).
    `:parallel` - Integer. Max concurrent extractions (default: 4).
    `:timeout-ms` - Integer, optional. HTTP timeout per page (default: 180000ms / 3 min).
    `:page-set` - Set of 0-indexed page numbers to extract, or nil for all pages.
                  When provided, only pages in the set are sent to the vision LLM.

Returns:
Vector of maps, one per page:
  `:page/index` - Integer. The page number (0-based).
  `:page/nodes` - Vector of document nodes (see extract-text-from-image for node structure).

Throws:
Anomaly (fault) if any page fails to extract.

source raw docstring

extract-text-from-string^clj

(extract-text-from-string content {:keys [model refine?] :as opts})

Extracts document content from string content using LLM.

Sends text directly to the multimodal LLM (no image rendering). When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: content - String. Text/markdown content to extract from. opts - Map with: :model - String. LLM model to use. :objective - String. System prompt for extraction. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from string content using LLM.

Sends text directly to the multimodal LLM (no image rendering).
When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`content` - String. Text/markdown content to extract from.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:objective` - String. System prompt for extraction.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.

source raw docstring

extract-text-from-text-file^clj

(extract-text-from-text-file file-path {:keys [model refine?] :as opts})

Extracts document content from a text or markdown file using LLM.

Sends text directly to the multimodal LLM (no image rendering). When :refine? is true, evaluates extraction quality and refines if below threshold.

Params: file-path - String. Path to the text/markdown file. opts - Map with: :model - String. LLM model to use. :objective - String. System prompt for extraction. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout. :refine? - Boolean, optional. Enable quality refinement. :refine-model - String, optional. Model for eval/refine (default: gpt-4o).

Returns: Vector with single map: :page/index - Integer. Always 0. :page/nodes - Vector of document nodes.

Extracts document content from a text or markdown file using LLM.

Sends text directly to the multimodal LLM (no image rendering).
When :refine? is true, evaluates extraction quality and refines if below threshold.

Params:
`file-path` - String. Path to the text/markdown file.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:objective` - String. System prompt for extraction.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout.
  `:refine?` - Boolean, optional. Enable quality refinement.
  `:refine-model` - String, optional. Model for eval/refine (default: gpt-4o).

Returns:
Vector with single map:
  `:page/index` - Integer. Always 0.
  `:page/nodes` - Vector of document nodes.

source raw docstring

image->base64^clj

(image->base64 image)

Converts a BufferedImage to a base64-encoded PNG string.

Params: image - BufferedImage. The image to convert.

Returns: String. Base64-encoded PNG data (without data:image/png;base64, prefix).

Converts a BufferedImage to a base64-encoded PNG string.

Params:
`image` - BufferedImage. The image to convert.

Returns:
String. Base64-encoded PNG data (without data:image/png;base64, prefix).

source raw docstring

image->bytes^clj

(image->bytes image)

Converts a BufferedImage to raw PNG bytes.

Params: image - BufferedImage. The image to convert.

Returns: byte[]. Raw PNG bytes.

Converts a BufferedImage to raw PNG bytes.

Params:
`image` - BufferedImage. The image to convert.

Returns:
byte[]. Raw PNG bytes.

source raw docstring

image->bytes-region^clj

(image->bytes-region image bbox)

Extracts a region from a BufferedImage and returns it as PNG bytes.

Params: image - BufferedImage. The source image. bbox - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns: byte[]. PNG bytes of the cropped region, or nil if bbox is invalid.

Extracts a region from a BufferedImage and returns it as PNG bytes.

Params:
`image` - BufferedImage. The source image.
`bbox` - Vector of [xmin, ymin, xmax, ymax] in PIXEL coordinates (already scaled).

Returns:
byte[]. PNG bytes of the cropped region, or nil if bbox is invalid.

source raw docstring

infer-document-title^clj

(infer-document-title pages
                      {:keys [model config timeout-ms] :or {timeout-ms 30000}})

Infers document title from extracted content using LLM.

Analyzes the document structure (headings, metadata, first paragraphs) to determine the most appropriate title.

Params: pages - Vector of page maps with :page/nodes. opts - Map with: :model - String. LLM model to use. :config - Map. LLM config with :api-key, :base-url. :timeout-ms - Integer, optional. HTTP timeout (default: 30000ms).

Returns: String. The inferred document title, or nil if cannot be inferred.

Infers document title from extracted content using LLM.

Analyzes the document structure (headings, metadata, first paragraphs)
to determine the most appropriate title.

Params:
`pages` - Vector of page maps with :page/nodes.
`opts` - Map with:
  `:model` - String. LLM model to use.
  `:config` - Map. LLM config with :api-key, :base-url.
  `:timeout-ms` - Integer, optional. HTTP timeout (default: 30000ms).

Returns:
String. The inferred document title, or nil if cannot be inferred.

source raw docstring

scale-and-clamp-bbox^clj

(scale-and-clamp-bbox bbox width height bbox-scale)

Scales bounding box from model coordinates to pixel coordinates, adds padding, then clamps to valid image dimensions.

Different vision models return bbox in different formats:

GLM-4.6V: normalized 0-1000 coordinates
GPT-4o/Claude: actual pixel coordinates

Params: bbox - Vector of [xmin, ymin, xmax, ymax] in model coordinates. width - Integer. Image width in pixels. height - Integer. Image height in pixels. bbox-scale - Integer or nil. If set, coords are in 0-N normalized space and will be scaled to pixels. If nil, coords are already pixels.

Returns: Vector of [xmin, ymin, xmax, ymax] in actual pixels with padding, clamped to valid range, or nil if invalid.

Scales bounding box from model coordinates to pixel coordinates,
adds padding, then clamps to valid image dimensions.

Different vision models return bbox in different formats:
- GLM-4.6V: normalized 0-1000 coordinates
- GPT-4o/Claude: actual pixel coordinates

Params:
`bbox` - Vector of [xmin, ymin, xmax, ymax] in model coordinates.
`width` - Integer. Image width in pixels.
`height` - Integer. Image height in pixels.
`bbox-scale` - Integer or nil. If set, coords are in 0-N normalized space
               and will be scaled to pixels. If nil, coords are already pixels.

Returns:
Vector of [xmin, ymin, xmax, ymax] in actual pixels with padding, clamped to valid range,
or nil if invalid.

source raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close