Liking cljdoc? Tell your friends :D

com.blockether.svar.internal.rlm.internal.pageindex.core

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:

  • build-index - Extract structure from file path or string content
  • index! - Index and save to EDN + PNG files
  • load-index - Load indexed document from EDN directory
  • inspect - Print full document summary with TOC tree
  • print-toc-tree - Print a formatted TOC tree from TOC entries

Supported file types:

  • PDF (.pdf) - Uses vision LLM for node-based extraction
  • Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
  • Plain text (.txt, .text) - Uses LLM for text extraction
  • Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:

  • Top-level headings (h1, or first level found) become page boundaries
  • Nested headings become section nodes within each page
  • No LLM required for structure extraction

Post-processing:

  1. Translates local node IDs to globally unique UUIDs
  2. If no TOC exists in document, generates one from Section/Heading structure
  3. Links TocEntry target-section-id to matching Section nodes
  4. Generates document abstract from all section descriptions using Chain of Density

Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries

Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction

Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density

Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")
raw docstring

build-indexcljmultimethod

Builds an index from a document by extracting content as nodes.

Multimethod that dispatches based on input type:

  • :path - File path (auto-detects type from extension: .pdf, .md, .txt)
  • :string - Raw string content (requires :content-type in opts)

Supported file types:

  • PDF (.pdf) - Uses vision LLM for node-based extraction
  • Markdown (.md, .markdown) - Parses headings as heading/paragraph nodes
  • Plain text (.txt, .text) - Chunks by paragraphs into paragraph nodes

Post-processing:

  • If document has TOC pages, extracts TocEntry nodes and links to Sections

  • If no TOC exists, generates one from Section/Heading structure

    Params: input - String. File path or raw content. opts - Optional map with: ;; For dispatch (string input) :content-type - Keyword. Required for string input: :md, :markdown, :txt, :text :doc-name - String. Document name (required for string input).

    ;; For metadata (string input only - PDF extracts from file) :doc-title - String. Document title. :doc-author - String. Document author. :created-at - Instant. Creation date. :updated-at - Instant. Modification date.

    ;; For processing :model - String. Vision LLM model to use. :pages - Page selector (1-indexed). Limits which pages are included. Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector. nil = all pages (default). Applied after extraction.

    ;; Quality refinement (opt-in) :refine? - Boolean, optional. Enable post-extraction quality refinement (default: false). :refine-model - String, optional. Model for eval/refine steps (default: "gpt-4o"). :refine-iterations - Integer, optional. Max refine iterations per page (default: 1). :refine-threshold - Float, optional. Min eval score to pass (default: 0.8). :refine-sample-size - Integer, optional. Pages to sample for eval (default: 3). For PDFs, samples first + last + random middle pages.

Returns: Map with: :document/name - String. Document name without extension. :document/title - String or nil. Document title from metadata. :document/abstract - String or nil. Document summary generated from section descriptions. :document/extension - String. File extension (pdf, md, txt). :document/pages - Vector of page maps with: - :page/index - Integer (0-indexed) - :page/nodes - Vector of content nodes (heading, paragraph, image, table, etc.) :document/toc - Vector of TocEntry nodes (extracted or generated): - :document.toc/type - :toc-entry - :document.toc/id - UUID string - :document.toc/title - Entry title text - :document.toc/description - Section description (copied from linked Section) - :document.toc/target-page - Page number (0-indexed) or nil - :document.toc/target-section-id - UUID of linked Section node or nil - :document.toc/level - Nesting level (l1, l2, etc.) :document/created-at - Instant. Creation date from metadata or now. :document/updated-at - Instant. Modification date from metadata or now. :document/author - String or nil. Document author from metadata.

Builds an index from a document by extracting content as nodes.

Multimethod that dispatches based on input type:
- `:path` - File path (auto-detects type from extension: .pdf, .md, .txt)
- `:string` - Raw string content (requires :content-type in opts)

Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses headings as heading/paragraph nodes
- Plain text (.txt, .text) - Chunks by paragraphs into paragraph nodes

Post-processing:
- If document has TOC pages, extracts TocEntry nodes and links to Sections
- If no TOC exists, generates one from Section/Heading structure

  Params:
  `input` - String. File path or raw content.
  `opts` - Optional map with:
    ;; For dispatch (string input)
    `:content-type` - Keyword. Required for string input: :md, :markdown, :txt, :text
    `:doc-name` - String. Document name (required for string input).
    
    ;; For metadata (string input only - PDF extracts from file)
    `:doc-title` - String. Document title.
    `:doc-author` - String. Document author.
    `:created-at` - Instant. Creation date.
    `:updated-at` - Instant. Modification date.
    
    ;; For processing
    `:model` - String. Vision LLM model to use.
    `:pages` - Page selector (1-indexed). Limits which pages are included.
               Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
               nil = all pages (default). Applied after extraction.
    
    ;; Quality refinement (opt-in)
    `:refine?` - Boolean, optional. Enable post-extraction quality refinement (default: false).
    `:refine-model` - String, optional. Model for eval/refine steps (default: "gpt-4o").
    `:refine-iterations` - Integer, optional. Max refine iterations per page (default: 1).
    `:refine-threshold` - Float, optional. Min eval score to pass (default: 0.8).
    `:refine-sample-size` - Integer, optional. Pages to sample for eval (default: 3).
                            For PDFs, samples first + last + random middle pages.

Returns:
Map with:
  `:document/name` - String. Document name without extension.
  `:document/title` - String or nil. Document title from metadata.
  `:document/abstract` - String or nil. Document summary generated from section descriptions.
  `:document/extension` - String. File extension (pdf, md, txt).
  `:document/pages` - Vector of page maps with:
    - `:page/index` - Integer (0-indexed)
    - `:page/nodes` - Vector of content nodes (heading, paragraph, image, table, etc.)
  `:document/toc` - Vector of TocEntry nodes (extracted or generated):
     - `:document.toc/type` - :toc-entry
     - `:document.toc/id` - UUID string
     - `:document.toc/title` - Entry title text
     - `:document.toc/description` - Section description (copied from linked Section)
     - `:document.toc/target-page` - Page number (0-indexed) or nil
     - `:document.toc/target-section-id` - UUID of linked Section node or nil
     - `:document.toc/level` - Nesting level (l1, l2, etc.)
  `:document/created-at` - Instant. Creation date from metadata or now.
  `:document/updated-at` - Instant. Modification date from metadata or now.
  `:document/author` - String or nil. Document author from metadata.
raw docstring

filter-pagesclj

(filter-pages page-list page-set)

Filters a page-list by a set of 0-indexed page indices.

If page-set is nil, returns page-list unchanged (all pages). Otherwise returns only pages whose :page/index is in page-set.

Filters a page-list by a set of 0-indexed page indices.

If page-set is nil, returns page-list unchanged (all pages).
Otherwise returns only pages whose :page/index is in page-set.
raw docstring

group-continuationsclj

(group-continuations pages)

Groups continuation nodes across pages by assigning a shared :page.node/group-id.

Walks pages in order. When a visual node (image/table) has continuation?=true, looks back to the last same-type node on the preceding page and assigns both the same group-id UUID. Propagates group-id forward for 3+ page chains.

Params: pages - Vector of page maps with :page/nodes (must have UUIDs already).

Returns: Updated pages with :page.node/group-id assigned to grouped nodes.

Groups continuation nodes across pages by assigning a shared :page.node/group-id.

Walks pages in order. When a visual node (image/table) has continuation?=true,
looks back to the last same-type node on the preceding page and assigns both
the same group-id UUID. Propagates group-id forward for 3+ page chains.

Params:
`pages` - Vector of page maps with :page/nodes (must have UUIDs already).

Returns:
Updated pages with :page.node/group-id assigned to grouped nodes.
raw docstring

index!clj

(index! file-path)
(index! file-path
        {:keys [output vision-model config parallel parallel-refine refine?
                refine-model refine-iterations refine-threshold
                refine-sample-size pages]})

Index a document file and save the result as EDN + PNG files.

Takes a file path (PDF, MD, TXT) and runs build-index to extract structure. The result is saved as a directory alongside the original (or custom path): document.pageindex/ document.edn — structured data (EDN) images/ — extracted images as PNG files

Params:
`file-path` - String. Path to the document file.
`opts` - Map, optional:
  - :output - Custom output directory path (default: same dir, .pageindex extension)
  - :config - LLM config override
  - :pages - Page selector (1-indexed). Limits which pages are indexed.
             Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
             nil = all pages (default).
  
  Vision extraction:
  - :vision-model - String. Model for vision page extraction (default: DEFAULT_VISION_MODEL).
  - :parallel - Integer. Max concurrent vision page extractions for PDFs (default: 3)
  
  Quality refinement (opt-in):
  - :refine? - Boolean. Enable post-extraction quality refinement (default: false)
  - :refine-model - String. Model for eval/refine steps (default: "gpt-4o")
  - :parallel-refine - Integer. Max concurrent eval/refine operations (default: 2)
  - :refine-iterations - Integer. Max refine iterations per page (default: 1)
  - :refine-threshold - Float. Min eval score to pass (default: 0.8)
  - :refine-sample-size - Integer. Pages to sample for eval (default: 3)

Returns:
Map with :document (the indexed document) and :output-path (directory where files were saved).

Throws:
- ex-info if file not found
- ex-info if document fails spec validation
- ex-info if :pages references out-of-bounds or invalid page numbers

Example:
(index! "docs/manual.pdf")
;; => {:document {...} :output-path "docs/manual.pageindex"}

;; Index only pages 1 through 5
(index! "docs/manual.pdf" {:pages [1 5]})

;; Index specific pages: 1-3, 5, and 7-10
(index! "docs/manual.pdf" {:pages [[1 3] 5 [7 10]]})

;; Separate models for vision vs refinement
(index! "docs/manual.pdf" {:vision-model "gpt-4o"
                             :refine? true
                             :refine-model "gpt-4o-mini"
                             :parallel 5
                             :parallel-refine 3})
Index a document file and save the result as EDN + PNG files.

Takes a file path (PDF, MD, TXT) and runs build-index to extract structure.
The result is saved as a directory alongside the original (or custom path):
  document.pageindex/
    document.edn    — structured data (EDN)
    images/          — extracted images as PNG files

    Params:
    `file-path` - String. Path to the document file.
    `opts` - Map, optional:
      - :output - Custom output directory path (default: same dir, .pageindex extension)
      - :config - LLM config override
      - :pages - Page selector (1-indexed). Limits which pages are indexed.
                 Supports: integer, [from to] range, or [[1 3] 5 [7 10]] mixed vector.
                 nil = all pages (default).
      
      Vision extraction:
      - :vision-model - String. Model for vision page extraction (default: DEFAULT_VISION_MODEL).
      - :parallel - Integer. Max concurrent vision page extractions for PDFs (default: 3)
      
      Quality refinement (opt-in):
      - :refine? - Boolean. Enable post-extraction quality refinement (default: false)
      - :refine-model - String. Model for eval/refine steps (default: "gpt-4o")
      - :parallel-refine - Integer. Max concurrent eval/refine operations (default: 2)
      - :refine-iterations - Integer. Max refine iterations per page (default: 1)
      - :refine-threshold - Float. Min eval score to pass (default: 0.8)
      - :refine-sample-size - Integer. Pages to sample for eval (default: 3)
    
    Returns:
    Map with :document (the indexed document) and :output-path (directory where files were saved).
    
    Throws:
    - ex-info if file not found
    - ex-info if document fails spec validation
    - ex-info if :pages references out-of-bounds or invalid page numbers
    
    Example:
    (index! "docs/manual.pdf")
    ;; => {:document {...} :output-path "docs/manual.pageindex"}
    
    ;; Index only pages 1 through 5
    (index! "docs/manual.pdf" {:pages [1 5]})
    
    ;; Index specific pages: 1-3, 5, and 7-10
    (index! "docs/manual.pdf" {:pages [[1 3] 5 [7 10]]})
    
    ;; Separate models for vision vs refinement
    (index! "docs/manual.pdf" {:vision-model "gpt-4o"
                                 :refine? true
                                 :refine-model "gpt-4o-mini"
                                 :parallel 5
                                 :parallel-refine 3})
raw docstring

inspectclj

(inspect doc-or-path)

Load and print a full summary of an indexed document including TOC tree.

Params: doc-or-path - Either a document map or String path to EDN file.

Returns: Summary map with document stats.

Throws:

  • ex-info if path provided and file not found
  • ex-info if document fails spec validation

Example: (inspect "docs/manual.edn") (inspect my-document)

Load and print a full summary of an indexed document including TOC tree.

Params:
`doc-or-path` - Either a document map or String path to EDN file.

Returns:
Summary map with document stats.

Throws:
- ex-info if path provided and file not found
- ex-info if document fails spec validation

Example:
(inspect "docs/manual.edn")
(inspect my-document)
raw docstring

load-indexclj

(load-index index-path)

Load an indexed document from a pageindex directory (EDN + PNG files).

Also supports loading legacy Nippy files for backward compatibility.

Params: index-path - String. Path to the pageindex directory or legacy .nippy file.

Returns: The RLM document map.

Throws:

  • ex-info if path not found
  • ex-info if document fails spec validation

Example: (load-index "docs/manual.pageindex")

Load an indexed document from a pageindex directory (EDN + PNG files).

Also supports loading legacy Nippy files for backward compatibility.

Params:
`index-path` - String. Path to the pageindex directory or legacy .nippy file.

Returns:
The RLM document map.

Throws:
- ex-info if path not found
- ex-info if document fails spec validation

Example:
(load-index "docs/manual.pageindex")
raw docstring

normalize-page-specclj

(normalize-page-spec pages-spec total-page-count)

Normalizes a page specification into a set of 0-indexed page indices.

Accepts:

  • nil → nil (all pages)
  • integer n → #{(dec n)} (single 1-indexed page)
  • [from to] → set of 0-indexed pages in range (both ints, exactly 2 elements)
  • [[1 3] 5 [7 10]] → union of expanded ranges and single pages

Throws on invalid input (out of bounds, bad types, reversed ranges).

Normalizes a page specification into a set of 0-indexed page indices.

Accepts:
- nil             → nil (all pages)
- integer n       → #{(dec n)} (single 1-indexed page)
- [from to]       → set of 0-indexed pages in range (both ints, exactly 2 elements)
- [[1 3] 5 [7 10]] → union of expanded ranges and single pages

Throws on invalid input (out of bounds, bad types, reversed ranges).
raw docstring

read-document-ednclj

(read-document-edn index-dir)

Reads a document from an EDN file, resolving image paths back to byte arrays.

Image paths in :page.node/image-path are read back as byte arrays into :page.node/image-data.

Params: index-dir - String. Path to the pageindex directory.

Returns: The PageIndex document map with image bytes restored.

Reads a document from an EDN file, resolving image paths back to byte arrays.

Image paths in :page.node/image-path are read back as byte arrays
into :page.node/image-data.

Params:
`index-dir` - String. Path to the pageindex directory.

Returns:
The PageIndex document map with image bytes restored.
raw docstring

write-document-edn!clj

(write-document-edn! output-dir document)

Writes a document to an EDN file, extracting image bytes to separate PNG files.

Image data (byte arrays) in :page.node/image-data are written as PNG files in an 'images' subdirectory. The EDN stores the relative path instead of bytes.

Instants are serialized as #inst tagged literals (EDN native).

Params: output-dir - String. Path to the output directory (e.g., 'docs/manual.pageindex'). document - Map. The PageIndex document.

Returns: The output directory path.

Writes a document to an EDN file, extracting image bytes to separate PNG files.

Image data (byte arrays) in :page.node/image-data are written as PNG files
in an 'images' subdirectory. The EDN stores the relative path instead of bytes.

Instants are serialized as #inst tagged literals (EDN native).

Params:
`output-dir` - String. Path to the output directory (e.g., 'docs/manual.pageindex').
`document` - Map. The PageIndex document.

Returns:
The output directory path.
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close