com.blockether.svar.internal.rlm.internal.pageindex

Liking cljdoc? Tell your friends :D

com.blockether.svar.internal.rlm.internal.pageindex.core

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:

build-index - Extract structure from file path or string content
index! - Index and save to EDN + PNG files
load-index - Load indexed document from EDN directory
inspect - Print full document summary with TOC tree
print-toc-tree - Print a formatted TOC tree from TOC entries

Supported file types:

PDF (.pdf) - Uses vision LLM for node-based extraction
Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
Plain text (.txt, .text) - Uses LLM for text extraction
Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:

Top-level headings (h1, or first level found) become page boundaries
Nested headings become section nodes within each page
No LLM required for structure extraction

Post-processing:

Translates local node IDs to globally unique UUIDs
If no TOC exists in document, generates one from Section/Heading structure
Links TocEntry target-section-id to matching Section nodes
Generates document abstract from all section descriptions using Chain of Density

Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries

Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction

Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density

Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")

raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.markdown

Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:

markdown->pages - Main API: convert markdown string to page-based format
markdown-file->pages - Convenience: reads file and calls markdown->pages

Design:

Top-level headings (h1, or first heading level found) become 'pages'
Nested headings become nodes within each page
Code blocks are skipped when parsing headings
Each section includes text from heading to next heading
No LLM required - deterministic parsing

Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:
- `markdown->pages` - Main API: convert markdown string to page-based format
- `markdown-file->pages` - Convenience: reads file and calls markdown->pages

Design:
- Top-level headings (h1, or first heading level found) become 'pages'
- Nested headings become nodes within each page
- Code blocks are skipped when parsing headings
- Each section includes text from heading to next heading
- No LLM required - deterministic parsing

raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.pdf

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:

pdf->images - Convert PDF file to vector of BufferedImage objects
page-count - Get total page count of a PDF file
pdf-metadata - Extract PDF metadata (author, title, dates, etc.)
detect-text-rotation - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI. Handles error cases: encrypted PDFs, corrupted files, file not found.

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:
- `pdf->images` - Convert PDF file to vector of BufferedImage objects
- `page-count` - Get total page count of a PDF file
- `pdf-metadata` - Extract PDF metadata (author, title, dates, etc.)
- `detect-text-rotation` - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI.
Handles error cases: encrypted PDFs, corrupted files, file not found.

raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.spec

Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide a clear view of the complete data model. Individual namespaces will require this namespace and use these specs for validation.

Data Model Philosophy:

FLAT structure with parent references (Datalevin-style)
All keywords are namespaced (:node/*, :page/*, :toc/*)
Vector of maps output (not nested trees)
:node/parent-id creates hierarchy (nil for root nodes)

Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide
a clear view of the complete data model. Individual namespaces will require
this namespace and use these specs for validation.

Data Model Philosophy:
- FLAT structure with parent references (Datalevin-style)
- All keywords are namespaced (`:node/*`, `:page/*`, `:toc/*`)
- Vector of maps output (not nested trees)
- `:node/parent-id` creates hierarchy (nil for root nodes)

raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.vision

Vision/LLM-based text extraction from documents.

Provides:

image->base64 - Convert BufferedImage to base64 PNG string
image->bytes - Convert BufferedImage to PNG byte array
image->bytes-region - Extract and convert a bounding-box region to PNG bytes
extract-image-region - Crop a BufferedImage to a bounding-box region
scale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensions
extract-text-from-image - Extract structured nodes from a single BufferedImage (vision)
extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)
extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)
extract-text-from-image-file - Extract from image file (vision)
extract-text-from-string - Extract from string content (LLM, no image rendering)
infer-document-title - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.

Vision/LLM-based text extraction from documents.

Provides:
- `image->base64` - Convert BufferedImage to base64 PNG string
- `image->bytes` - Convert BufferedImage to PNG byte array
- `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes
- `extract-image-region` - Crop a BufferedImage to a bounding-box region
- `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions
- `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision)
- `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision)
- `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering)
- `extract-text-from-image-file` - Extract from image file (vision)
- `extract-text-from-string` - Extract from string content (LLM, no image rendering)
- `infer-document-title` - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps.
Uses multimodal LLM for both image and text extraction.
Parallel extraction using core.async channels for PDFs.

raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close