Main API for RLM document indexing - extracts structured data from documents.
Primary functions:
build-index - Extract structure from file path or string contentindex! - Index and save to EDN + PNG filesload-index - Load indexed document from EDN directoryinspect - Print full document summary with TOC treeprint-toc-tree - Print a formatted TOC tree from TOC entriesSupported file types:
Markdown files are parsed deterministically by heading structure:
Post-processing:
Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])
;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))
;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}
;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")
Main API for RLM document indexing - extracts structured data from documents.
Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries
Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction
Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction
Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density
Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])
;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))
;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}
;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")Markdown parsing for RLM - extracts hierarchical structure from markdown files.
Primary functions:
markdown->pages - Main API: convert markdown string to page-based formatmarkdown-file->pages - Convenience: reads file and calls markdown->pagesDesign:
Markdown parsing for RLM - extracts hierarchical structure from markdown files. Primary functions: - `markdown->pages` - Main API: convert markdown string to page-based format - `markdown-file->pages` - Convenience: reads file and calls markdown->pages Design: - Top-level headings (h1, or first heading level found) become 'pages' - Nested headings become nodes within each page - Code blocks are skipped when parsing headings - Each section includes text from heading to next heading - No LLM required - deterministic parsing
PDF to images conversion and metadata extraction using Apache PDFBox.
Provides:
pdf->images - Convert PDF file to vector of BufferedImage objectspage-count - Get total page count of a PDF filepdf-metadata - Extract PDF metadata (author, title, dates, etc.)detect-text-rotation - Detect content rotation per page using text position heuristicsUses PDFBox for reliable PDF rendering at configurable DPI. Handles error cases: encrypted PDFs, corrupted files, file not found.
PDF to images conversion and metadata extraction using Apache PDFBox. Provides: - `pdf->images` - Convert PDF file to vector of BufferedImage objects - `page-count` - Get total page count of a PDF file - `pdf-metadata` - Extract PDF metadata (author, title, dates, etc.) - `detect-text-rotation` - Detect content rotation per page using text position heuristics Uses PDFBox for reliable PDF rendering at configurable DPI. Handles error cases: encrypted PDFs, corrupted files, file not found.
Comprehensive clojure.spec definitions for RLM data structures.
This namespace centralizes ALL specs for the RLM system to provide a clear view of the complete data model. Individual namespaces will require this namespace and use these specs for validation.
Data Model Philosophy:
:node/*, :page/*, :toc/*):node/parent-id creates hierarchy (nil for root nodes)Comprehensive clojure.spec definitions for RLM data structures. This namespace centralizes ALL specs for the RLM system to provide a clear view of the complete data model. Individual namespaces will require this namespace and use these specs for validation. Data Model Philosophy: - FLAT structure with parent references (Datalevin-style) - All keywords are namespaced (`:node/*`, `:page/*`, `:toc/*`) - Vector of maps output (not nested trees) - `:node/parent-id` creates hierarchy (nil for root nodes)
Vision/LLM-based text extraction from documents.
Provides:
image->base64 - Convert BufferedImage to base64 PNG stringimage->bytes - Convert BufferedImage to PNG byte arrayimage->bytes-region - Extract and convert a bounding-box region to PNG bytesextract-image-region - Crop a BufferedImage to a bounding-box regionscale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensionsextract-text-from-image - Extract structured nodes from a single BufferedImage (vision)extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)extract-text-from-image-file - Extract from image file (vision)extract-text-from-string - Extract from string content (LLM, no image rendering)infer-document-title - Infer a document title from page content using LLMConfiguration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.
Vision/LLM-based text extraction from documents. Provides: - `image->base64` - Convert BufferedImage to base64 PNG string - `image->bytes` - Convert BufferedImage to PNG byte array - `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes - `extract-image-region` - Crop a BufferedImage to a bounding-box region - `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions - `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision) - `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision) - `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering) - `extract-text-from-image-file` - Extract from image file (vision) - `extract-text-from-string` - Extract from string content (LLM, no image rendering) - `infer-document-title` - Infer a document title from page content using LLM Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |