Main API for RLM document indexing - extracts structured data from documents.
Primary functions:
build-index - Extract structure from file path or string contentindex! - Index and save to EDN + PNG filesload-index - Load indexed document from EDN directoryinspect - Print full document summary with TOC treeprint-toc-tree - Print a formatted TOC tree from TOC entriesSupported file types:
Markdown files are parsed deterministically by heading structure:
Post-processing:
Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])
;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))
;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}
;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")
Main API for RLM document indexing - extracts structured data from documents.
Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries
Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction
Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction
Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density
Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])
;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))
;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}
;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")Builds an index from a document by extracting content as nodes.
Multimethod that dispatches based on input type:
:path - File path (auto-detects type from extension: .pdf, .md, .txt):string - Raw string content (requires :content-type in opts)Supported file types:
Post-processing:
If document has TOC pages, extracts TocEntry nodes and links to Sections
If no TOC exists, generates one from Section/Heading structure
Params:
input - String. File path or raw content.
opts - Optional map with:
;; For dispatch (string input)
:content-type - Keyword. Required for string input: :md, :markdown, :txt, :text
:doc-name - String. Document name (required for string input).
;; For metadata (string input only - PDF extracts from file)
:doc-title - String. Document title.
:doc-author - String. Document author.
:created-at - Instant. Creation date.
:updated-at - Instant. Modification date.
;; For processing
:model - String. Vision LLM model to use.
:pages - Page selector (1-indexed). Limits which pages are included.
Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
nil = all pages (default). Applied after extraction.
;; Quality refinement (opt-in)
:refine? - Boolean, optional. Enable post-extraction quality refinement (default: false).
:refine-model - String, optional. Model for eval/refine steps (default: "gpt-4o").
:refine-iterations - Integer, optional. Max refine iterations per page (default: 1).
:refine-threshold - Float, optional. Min eval score to pass (default: 0.8).
:refine-sample-size - Integer, optional. Pages to sample for eval (default: 3).
For PDFs, samples first + last + random middle pages.
Returns:
Map with:
:document/name - String. Document name without extension.
:document/title - String or nil. Document title from metadata.
:document/abstract - String or nil. Document summary generated from section descriptions.
:document/extension - String. File extension (pdf, md, txt).
:document/pages - Vector of page maps with:
- :page/index - Integer (0-indexed)
- :page/nodes - Vector of content nodes (heading, paragraph, image, table, etc.)
:document/toc - Vector of TocEntry nodes (extracted or generated):
- :document.toc/type - :toc-entry
- :document.toc/id - UUID string
- :document.toc/title - Entry title text
- :document.toc/description - Section description (copied from linked Section)
- :document.toc/target-page - Page number (0-indexed) or nil
- :document.toc/target-section-id - UUID of linked Section node or nil
- :document.toc/level - Nesting level (l1, l2, etc.)
:document/created-at - Instant. Creation date from metadata or now.
:document/updated-at - Instant. Modification date from metadata or now.
:document/author - String or nil. Document author from metadata.
Builds an index from a document by extracting content as nodes.
Multimethod that dispatches based on input type:
- `:path` - File path (auto-detects type from extension: .pdf, .md, .txt)
- `:string` - Raw string content (requires :content-type in opts)
Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses headings as heading/paragraph nodes
- Plain text (.txt, .text) - Chunks by paragraphs into paragraph nodes
Post-processing:
- If document has TOC pages, extracts TocEntry nodes and links to Sections
- If no TOC exists, generates one from Section/Heading structure
Params:
`input` - String. File path or raw content.
`opts` - Optional map with:
;; For dispatch (string input)
`:content-type` - Keyword. Required for string input: :md, :markdown, :txt, :text
`:doc-name` - String. Document name (required for string input).
;; For metadata (string input only - PDF extracts from file)
`:doc-title` - String. Document title.
`:doc-author` - String. Document author.
`:created-at` - Instant. Creation date.
`:updated-at` - Instant. Modification date.
;; For processing
`:model` - String. Vision LLM model to use.
`:pages` - Page selector (1-indexed). Limits which pages are included.
Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
nil = all pages (default). Applied after extraction.
;; Quality refinement (opt-in)
`:refine?` - Boolean, optional. Enable post-extraction quality refinement (default: false).
`:refine-model` - String, optional. Model for eval/refine steps (default: "gpt-4o").
`:refine-iterations` - Integer, optional. Max refine iterations per page (default: 1).
`:refine-threshold` - Float, optional. Min eval score to pass (default: 0.8).
`:refine-sample-size` - Integer, optional. Pages to sample for eval (default: 3).
For PDFs, samples first + last + random middle pages.
Returns:
Map with:
`:document/name` - String. Document name without extension.
`:document/title` - String or nil. Document title from metadata.
`:document/abstract` - String or nil. Document summary generated from section descriptions.
`:document/extension` - String. File extension (pdf, md, txt).
`:document/pages` - Vector of page maps with:
- `:page/index` - Integer (0-indexed)
- `:page/nodes` - Vector of content nodes (heading, paragraph, image, table, etc.)
`:document/toc` - Vector of TocEntry nodes (extracted or generated):
- `:document.toc/type` - :toc-entry
- `:document.toc/id` - UUID string
- `:document.toc/title` - Entry title text
- `:document.toc/description` - Section description (copied from linked Section)
- `:document.toc/target-page` - Page number (0-indexed) or nil
- `:document.toc/target-section-id` - UUID of linked Section node or nil
- `:document.toc/level` - Nesting level (l1, l2, etc.)
`:document/created-at` - Instant. Creation date from metadata or now.
`:document/updated-at` - Instant. Modification date from metadata or now.
`:document/author` - String or nil. Document author from metadata.(filter-pages page-list page-set)Filters a page-list by a set of 0-indexed page indices.
If page-set is nil, returns page-list unchanged (all pages). Otherwise returns only pages whose :page/index is in page-set.
Filters a page-list by a set of 0-indexed page indices. If page-set is nil, returns page-list unchanged (all pages). Otherwise returns only pages whose :page/index is in page-set.
(group-continuations pages)Groups continuation nodes across pages by assigning a shared :page.node/group-id.
Walks pages in order. When a visual node (image/table) has continuation?=true, looks back to the last same-type node on the preceding page and assigns both the same group-id UUID. Propagates group-id forward for 3+ page chains.
Params:
pages - Vector of page maps with :page/nodes (must have UUIDs already).
Returns: Updated pages with :page.node/group-id assigned to grouped nodes.
Groups continuation nodes across pages by assigning a shared :page.node/group-id. Walks pages in order. When a visual node (image/table) has continuation?=true, looks back to the last same-type node on the preceding page and assigns both the same group-id UUID. Propagates group-id forward for 3+ page chains. Params: `pages` - Vector of page maps with :page/nodes (must have UUIDs already). Returns: Updated pages with :page.node/group-id assigned to grouped nodes.
(index! file-path)(index! file-path
{:keys [output vision-model config parallel parallel-refine refine?
refine-model refine-iterations refine-threshold
refine-sample-size pages]})Index a document file and save the result as EDN + PNG files.
Takes a file path (PDF, MD, TXT) and runs build-index to extract structure. The result is saved as a directory alongside the original (or custom path): document.pageindex/ document.edn — structured data (EDN) images/ — extracted images as PNG files
Params:
`file-path` - String. Path to the document file.
`opts` - Map, optional:
- :output - Custom output directory path (default: same dir, .pageindex extension)
- :config - LLM config override
- :pages - Page selector (1-indexed). Limits which pages are indexed.
Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
nil = all pages (default).
Vision extraction:
- :vision-model - String. Model for vision page extraction (default: DEFAULT_VISION_MODEL).
- :parallel - Integer. Max concurrent vision page extractions for PDFs (default: 3)
Quality refinement (opt-in):
- :refine? - Boolean. Enable post-extraction quality refinement (default: false)
- :refine-model - String. Model for eval/refine steps (default: "gpt-4o")
- :parallel-refine - Integer. Max concurrent eval/refine operations (default: 2)
- :refine-iterations - Integer. Max refine iterations per page (default: 1)
- :refine-threshold - Float. Min eval score to pass (default: 0.8)
- :refine-sample-size - Integer. Pages to sample for eval (default: 3)
Returns:
Map with :document (the indexed document) and :output-path (directory where files were saved).
Throws:
- ex-info if file not found
- ex-info if document fails spec validation
- ex-info if :pages references out-of-bounds or invalid page numbers
Example:
(index! "docs/manual.pdf")
;; => {:document {...} :output-path "docs/manual.pageindex"}
;; Index only pages 1 through 5
(index! "docs/manual.pdf" {:pages [1 5]})
;; Index specific pages: 1-3, 5, and 7-10
(index! "docs/manual.pdf" {:pages [[1 3] 5 [7 10]]})
;; Separate models for vision vs refinement
(index! "docs/manual.pdf" {:vision-model "gpt-4o"
:refine? true
:refine-model "gpt-4o-mini"
:parallel 5
:parallel-refine 3})
Index a document file and save the result as EDN + PNG files.
Takes a file path (PDF, MD, TXT) and runs build-index to extract structure.
The result is saved as a directory alongside the original (or custom path):
document.pageindex/
document.edn — structured data (EDN)
images/ — extracted images as PNG files
Params:
`file-path` - String. Path to the document file.
`opts` - Map, optional:
- :output - Custom output directory path (default: same dir, .pageindex extension)
- :config - LLM config override
- :pages - Page selector (1-indexed). Limits which pages are indexed.
Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
nil = all pages (default).
Vision extraction:
- :vision-model - String. Model for vision page extraction (default: DEFAULT_VISION_MODEL).
- :parallel - Integer. Max concurrent vision page extractions for PDFs (default: 3)
Quality refinement (opt-in):
- :refine? - Boolean. Enable post-extraction quality refinement (default: false)
- :refine-model - String. Model for eval/refine steps (default: "gpt-4o")
- :parallel-refine - Integer. Max concurrent eval/refine operations (default: 2)
- :refine-iterations - Integer. Max refine iterations per page (default: 1)
- :refine-threshold - Float. Min eval score to pass (default: 0.8)
- :refine-sample-size - Integer. Pages to sample for eval (default: 3)
Returns:
Map with :document (the indexed document) and :output-path (directory where files were saved).
Throws:
- ex-info if file not found
- ex-info if document fails spec validation
- ex-info if :pages references out-of-bounds or invalid page numbers
Example:
(index! "docs/manual.pdf")
;; => {:document {...} :output-path "docs/manual.pageindex"}
;; Index only pages 1 through 5
(index! "docs/manual.pdf" {:pages [1 5]})
;; Index specific pages: 1-3, 5, and 7-10
(index! "docs/manual.pdf" {:pages [[1 3] 5 [7 10]]})
;; Separate models for vision vs refinement
(index! "docs/manual.pdf" {:vision-model "gpt-4o"
:refine? true
:refine-model "gpt-4o-mini"
:parallel 5
:parallel-refine 3})(inspect doc-or-path)Load and print a full summary of an indexed document including TOC tree.
Params:
doc-or-path - Either a document map or String path to EDN file.
Returns: Summary map with document stats.
Throws:
Example: (inspect "docs/manual.edn") (inspect my-document)
Load and print a full summary of an indexed document including TOC tree. Params: `doc-or-path` - Either a document map or String path to EDN file. Returns: Summary map with document stats. Throws: - ex-info if path provided and file not found - ex-info if document fails spec validation Example: (inspect "docs/manual.edn") (inspect my-document)
(load-index index-path)Load an indexed document from a pageindex directory (EDN + PNG files).
Also supports loading legacy Nippy files for backward compatibility.
Params:
index-path - String. Path to the pageindex directory or legacy .nippy file.
Returns: The RLM document map.
Throws:
Example: (load-index "docs/manual.pageindex")
Load an indexed document from a pageindex directory (EDN + PNG files). Also supports loading legacy Nippy files for backward compatibility. Params: `index-path` - String. Path to the pageindex directory or legacy .nippy file. Returns: The RLM document map. Throws: - ex-info if path not found - ex-info if document fails spec validation Example: (load-index "docs/manual.pageindex")
(normalize-page-spec pages-spec total-page-count)Normalizes a page specification into a set of 0-indexed page indices.
Accepts:
Throws on invalid input (out of bounds, bad types, reversed ranges).
Normalizes a page specification into a set of 0-indexed page indices.
Accepts:
- nil → nil (all pages)
- integer n → #{(dec n)} (single 1-indexed page)
- [from to] → set of 0-indexed pages in range (both ints, exactly 2 elements)
- [[1 3] 5 [7 10]] → union of expanded ranges and single pages
Throws on invalid input (out of bounds, bad types, reversed ranges).(read-document-edn index-dir)Reads a document from an EDN file, resolving image paths back to byte arrays.
Image paths in :page.node/image-path are read back as byte arrays into :page.node/image-data.
Params:
index-dir - String. Path to the pageindex directory.
Returns: The PageIndex document map with image bytes restored.
Reads a document from an EDN file, resolving image paths back to byte arrays. Image paths in :page.node/image-path are read back as byte arrays into :page.node/image-data. Params: `index-dir` - String. Path to the pageindex directory. Returns: The PageIndex document map with image bytes restored.
(write-document-edn! output-dir document)Writes a document to an EDN file, extracting image bytes to separate PNG files.
Image data (byte arrays) in :page.node/image-data are written as PNG files in an 'images' subdirectory. The EDN stores the relative path instead of bytes.
Instants are serialized as #inst tagged literals (EDN native).
Params:
output-dir - String. Path to the output directory (e.g., 'docs/manual.pageindex').
document - Map. The PageIndex document.
Returns: The output directory path.
Writes a document to an EDN file, extracting image bytes to separate PNG files. Image data (byte arrays) in :page.node/image-data are written as PNG files in an 'images' subdirectory. The EDN stores the relative path instead of bytes. Instants are serialized as #inst tagged literals (EDN native). Params: `output-dir` - String. Path to the output directory (e.g., 'docs/manual.pageindex'). `document` - Map. The PageIndex document. Returns: The output directory path.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |