Liking cljdoc? Tell your friends :D

com.blockether.svar.internal.rlm.internal.pageindex.core

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:

  • build-index - Extract structure from file path or string content
  • index! - Index and save to EDN + PNG files
  • load-index - Load indexed document from EDN directory
  • inspect - Print full document summary with TOC tree
  • print-toc-tree - Print a formatted TOC tree from TOC entries

Supported file types:

  • PDF (.pdf) - Uses vision LLM for node-based extraction
  • Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
  • Plain text (.txt, .text) - Uses LLM for text extraction
  • Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:

  • Top-level headings (h1, or first level found) become page boundaries
  • Nested headings become section nodes within each page
  • No LLM required for structure extraction

Post-processing:

  1. Translates local node IDs to globally unique UUIDs
  2. If no TOC exists in document, generates one from Section/Heading structure
  3. Links TocEntry target-section-id to matching Section nodes
  4. Generates document abstract from all section descriptions using Chain of Density

Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries

Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction

Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density

Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.markdown

Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:

  • markdown->pages - Main API: convert markdown string to page-based format
  • markdown-file->pages - Convenience: reads file and calls markdown->pages

Design:

  • Top-level headings (h1, or first heading level found) become 'pages'
  • Nested headings become nodes within each page
  • Code blocks are skipped when parsing headings
  • Each section includes text from heading to next heading
  • No LLM required - deterministic parsing
Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:
- `markdown->pages` - Main API: convert markdown string to page-based format
- `markdown-file->pages` - Convenience: reads file and calls markdown->pages

Design:
- Top-level headings (h1, or first heading level found) become 'pages'
- Nested headings become nodes within each page
- Code blocks are skipped when parsing headings
- Each section includes text from heading to next heading
- No LLM required - deterministic parsing
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.pdf

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:

  • pdf->images - Convert PDF file to vector of BufferedImage objects
  • page-count - Get total page count of a PDF file
  • pdf-metadata - Extract PDF metadata (author, title, dates, etc.)
  • detect-text-rotation - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI. Handles error cases: encrypted PDFs, corrupted files, file not found.

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:
- `pdf->images` - Convert PDF file to vector of BufferedImage objects
- `page-count` - Get total page count of a PDF file
- `pdf-metadata` - Extract PDF metadata (author, title, dates, etc.)
- `detect-text-rotation` - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI.
Handles error cases: encrypted PDFs, corrupted files, file not found.
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.spec

Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide a clear view of the complete data model. Individual namespaces will require this namespace and use these specs for validation.

Data Model Philosophy:

  • FLAT structure with parent references (Datalevin-style)
  • All keywords are namespaced (:node/*, :page/*, :toc/*)
  • Vector of maps output (not nested trees)
  • :node/parent-id creates hierarchy (nil for root nodes)
Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide
a clear view of the complete data model. Individual namespaces will require
this namespace and use these specs for validation.

Data Model Philosophy:
- FLAT structure with parent references (Datalevin-style)
- All keywords are namespaced (`:node/*`, `:page/*`, `:toc/*`)
- Vector of maps output (not nested trees)
- `:node/parent-id` creates hierarchy (nil for root nodes)
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.vision

Vision/LLM-based text extraction from documents.

Provides:

  • image->base64 - Convert BufferedImage to base64 PNG string
  • image->bytes - Convert BufferedImage to PNG byte array
  • image->bytes-region - Extract and convert a bounding-box region to PNG bytes
  • extract-image-region - Crop a BufferedImage to a bounding-box region
  • scale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensions
  • extract-text-from-image - Extract structured nodes from a single BufferedImage (vision)
  • extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)
  • extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)
  • extract-text-from-image-file - Extract from image file (vision)
  • extract-text-from-string - Extract from string content (LLM, no image rendering)
  • infer-document-title - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.

Vision/LLM-based text extraction from documents.

Provides:
- `image->base64` - Convert BufferedImage to base64 PNG string
- `image->bytes` - Convert BufferedImage to PNG byte array
- `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes
- `extract-image-region` - Crop a BufferedImage to a bounding-box region
- `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions
- `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision)
- `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision)
- `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering)
- `extract-text-from-image-file` - Extract from image file (vision)
- `extract-text-from-string` - Extract from string content (LLM, no image rendering)
- `infer-document-title` - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps.
Uses multimodal LLM for both image and text extraction.
Parallel extraction using core.async channels for PDFs.
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close