Liking cljdoc? Tell your friends :D

com.blockether.svar.core

LLM interaction utilities for structured and unstructured outputs.

SVAR = Structured Validated Automated Reasoning

Provides main functions:

  • ask! - Structured output using the spec DSL
  • abstract! - Text summarization using Chain of Density prompting
  • eval! - LLM self-evaluation for reliability and accuracy assessment
  • refine! - Iterative refinement using decomposition and verification
  • models! - Fetch available models from the LLM API
  • sample! - Generate test data samples matching a spec

Guardrails:

  • static-guard - Pattern-based prompt injection detection
  • moderation-guard - LLM-based content moderation
  • guard - Run one or more guards on input

Humanization:

  • humanize-string - Strip AI-style phrases from text
  • humanize-data - Humanize string values in data structures
  • humanizer - Create a reusable humanizer function

PageIndex:

  • index! - Index a document file (PDF, MD, TXT) and save structured data
  • load-index - Load an indexed document from a pageindex directory

Re-exports spec DSL (field, spec, str->data, str->data-with-spec, data->str, validate-data, spec->prompt, build-ref-registry), RLM (create-env, register-env-fn!, register-env-def!, ingest-to-env!, dispose-env!, query-env!, pprint-trace, print-trace, generate-qa-env!), PageIndex (index!, load-index), and make-config so users can require only this namespace.

Configuration: Config MUST be passed explicitly to all LLM functions via the :config parameter. No global state. No dependency injection.

Example: (def config (make-config {:api-key "sk-..." :base-url "https://api.openai.com/v1"})) (ask! {:config config :spec my-spec :messages [(system "Help the user.") (user "What is 2+2?")] :model "gpt-4o"})

References:

LLM interaction utilities for structured and unstructured outputs.

SVAR = Structured Validated Automated Reasoning

 Provides main functions:
 - `ask!` - Structured output using the spec DSL
 - `abstract!` - Text summarization using Chain of Density prompting
 - `eval!` - LLM self-evaluation for reliability and accuracy assessment
 - `refine!` - Iterative refinement using decomposition and verification
 - `models!` - Fetch available models from the LLM API
 - `sample!` - Generate test data samples matching a spec
 
 Guardrails:
 - `static-guard` - Pattern-based prompt injection detection
 - `moderation-guard` - LLM-based content moderation
 - `guard` - Run one or more guards on input
 
 Humanization:
 - `humanize-string` - Strip AI-style phrases from text
 - `humanize-data` - Humanize string values in data structures
 - `humanizer` - Create a reusable humanizer function
 
  PageIndex:
  - `index!` - Index a document file (PDF, MD, TXT) and save structured data
  - `load-index` - Load an indexed document from a pageindex directory
 
  Re-exports spec DSL (`field`, `spec`, `str->data`, `str->data-with-spec`,
  `data->str`, `validate-data`, `spec->prompt`, `build-ref-registry`),
  RLM (`create-env`, `register-env-fn!`, `register-env-def!`, `ingest-to-env!`, `dispose-env!`,
  `query-env!`, `pprint-trace`, `print-trace`, `generate-qa-env!`),
  PageIndex (`index!`, `load-index`), and
  `make-config` so users can require only this namespace.

Configuration:
Config MUST be passed explicitly to all LLM functions via the :config parameter.
No global state. No dependency injection.

 Example:
 (def config (make-config {:api-key "sk-..." :base-url "https://api.openai.com/v1"}))
  (ask! {:config config
         :spec my-spec
         :messages [(system "Help the user.")
                    (user "What is 2+2?")]
         :model "gpt-4o"})

References:
- Chain of Density: https://arxiv.org/abs/2309.04269
- LLM Self-Evaluation: https://learnprompting.org/docs/reliability/lm_self_eval
- DuTy: https://learnprompting.org/docs/advanced/decomposition/duty-distinct-chain-of-thought
- CoVe: https://learnprompting.org/docs/advanced/self_criticism/chain_of_verification
raw docstring

com.blockether.svar.internal.config

LLM configuration management.

Provides a single make-config for creating validated config maps. No DI, no global state. Config is a plain immutable map.

Environment variables (used as fallback for :api-key and :base-url):

  • BLOCKETHER_OPENAI_API_KEY (checked first)
  • BLOCKETHER_OPENAI_BASE_URL (checked first)
  • OPENAI_API_KEY
  • OPENAI_BASE_URL

Usage: (def config (make-config {:api-key "sk-..." :base-url "https://api.openai.com/v1" :model "gpt-4o"})) (ask! {:config config :spec my-spec :messages [(system "...") (user "...")]})

LLM configuration management.

Provides a single `make-config` for creating validated config maps.
No DI, no global state. Config is a plain immutable map.

 Environment variables (used as fallback for :api-key and :base-url):
  - BLOCKETHER_OPENAI_API_KEY (checked first)
  - BLOCKETHER_OPENAI_BASE_URL (checked first)
  - OPENAI_API_KEY
  - OPENAI_BASE_URL

Usage:
(def config (make-config {:api-key "sk-..."
                           :base-url "https://api.openai.com/v1"
                           :model "gpt-4o"}))
 (ask! {:config config :spec my-spec :messages [(system "...") (user "...")]})
raw docstring

com.blockether.svar.internal.guard

Input guardrails for LLM interactions.

Provides factory functions that create guards to validate user input:

  • static - Pattern-based detection of prompt injection attempts
  • moderation - LLM-based content policy violation detection (requires :ask-fn)
  • guard - Runs one or more guards on input

Guards are functions that take input and return it unchanged on success, or throw ExceptionInfo on violation.

Usage: (require '[com.blockether.svar.core :as svar]) (def my-guards [(static) (moderation {:ask-fn svar/ask! :policies #{:hate}})]) (-> user-input (guard my-guards) (svar/ask! ...))

Input guardrails for LLM interactions.

Provides factory functions that create guards to validate user input:
- `static` - Pattern-based detection of prompt injection attempts
- `moderation` - LLM-based content policy violation detection (requires :ask-fn)
- `guard` - Runs one or more guards on input

Guards are functions that take input and return it unchanged on success,
or throw ExceptionInfo on violation.

Usage:
(require '[com.blockether.svar.core :as svar])
(def my-guards [(static) 
                (moderation {:ask-fn svar/ask! :policies #{:hate}})])
(-> user-input
    (guard my-guards)
    (svar/ask! ...))
raw docstring

com.blockether.svar.internal.humanize

AI response humanization module.

Removes AI-style phrases and patterns from LLM outputs to make responses sound more natural and human-like.

Two tiers of patterns:

  • SAFE_PATTERNS (default): AI identity, refusal, knowledge, punctuation. Unambiguously AI-generated; safe for arbitrary text.
  • AGGRESSIVE_PATTERNS (opt-in): hedging, overused verbs/adjectives/nouns, opening/closing cliches. May match valid English in non-AI text.
AI response humanization module.

Removes AI-style phrases and patterns from LLM outputs to make responses
sound more natural and human-like.

Two tiers of patterns:
- SAFE_PATTERNS (default): AI identity, refusal, knowledge, punctuation.
  Unambiguously AI-generated; safe for arbitrary text.
- AGGRESSIVE_PATTERNS (opt-in): hedging, overused verbs/adjectives/nouns,
  opening/closing cliches. May match valid English in non-AI text.
raw docstring

com.blockether.svar.internal.jsonish

Wrapper for the JsonishParser Java class.

Provides SAP (Schemaless Adaptive Parsing) for malformed JSON from LLMs. Handles unquoted keys/values, trailing commas, markdown code blocks, etc.

Wrapper for the JsonishParser Java class.

Provides SAP (Schemaless Adaptive Parsing) for malformed JSON from LLMs.
Handles unquoted keys/values, trailing commas, markdown code blocks, etc.
raw docstring

com.blockether.svar.internal.llm

LLM client layer: HTTP transport, message construction, and all LLM interaction functions (ask!, abstract!, eval!, refine!, models!, sample!).

Extracted from svar.core to break the cyclic dependency between core and rlm. rlm.clj requires this namespace directly instead of svar.core.

LLM client layer: HTTP transport, message construction, and all LLM interaction
functions (ask!, abstract!, eval!, refine!, models!, sample!).

Extracted from svar.core to break the cyclic dependency between core and rlm.
rlm.clj requires this namespace directly instead of svar.core.
raw docstring

com.blockether.svar.internal.rlm

Recursive Language Model (RLM) for processing arbitrarily large contexts.

RLM enables an LLM to iteratively write and execute Clojure code to examine, filter, and process large contexts that exceed token limits. The LLM writes code that runs in a sandboxed SCI (Small Clojure Interpreter) environment, inspects results, and decides whether to continue iterating or return a final answer.

API

;; 1. Create environment (holds DB, config, SCI context)
(def env (rlm/create-env {:config llm-config :path "/tmp/my-rlm"}))

;; 2. Ingest documents (can call multiple times)
(rlm/ingest-to-env! env documents)
(rlm/ingest-to-env! env more-documents)

;; 3. Run queries (reuses same env)
(rlm/query-env! env "What is X?")
(rlm/query-env! env "Find Y" {:spec my-spec})

;; 4. Dispose when done
(rlm/dispose-env! env)

Key Features

  • Iterative code execution: LLM writes code, sees results, writes more code
  • FINAL termination: LLM signals completion by returning {:FINAL result}
  • Recursive llm-query: Code can call back to the LLM for sub-tasks
  • Sandboxed evaluation: Uses SCI for safe, controlled code execution
  • Documents: Complete structure stored exactly as-is:
  • Documents with metadata
  • Pages with page nodes (paragraphs, headings, images, tables)
  • TOC entries
  • Learnings: DB-backed meta-insights that persist across sessions
  • Spec support: Define output shape, validate FINAL answers
  • Auto-refinement: Self-critique loop improves answer quality

LLM Available Functions (in SCI sandbox)

Document search:

  • (list-documents) - List all stored documents
  • (get-document doc-id) - Get document metadata
  • (search-page-nodes query) - List/filter actual content
  • (get-page-node node-id) - Get full page node content
  • (list-page-nodes opts) - List page nodes with filters
  • (search-toc-entries query) - List/filter table of contents
  • (get-toc-entry entry-id) - Get TOC entry
  • (list-toc-entries) - List all TOC entries

Learnings:

  • (store-learning insight) - Store meta-insight
  • (search-learnings query) - Search learnings
  • (vote-learning id :useful/:not-useful) - Vote on learning

History:

  • (search-history n) - Get recent messages (default 5)
  • (get-history n) - Get recent messages (default 10)
Recursive Language Model (RLM) for processing arbitrarily large contexts.

RLM enables an LLM to iteratively write and execute Clojure code to examine,
filter, and process large contexts that exceed token limits. The LLM writes
code that runs in a sandboxed SCI (Small Clojure Interpreter) environment,
inspects results, and decides whether to continue iterating or return a final
answer.

## API

```clojure
;; 1. Create environment (holds DB, config, SCI context)
(def env (rlm/create-env {:config llm-config :path "/tmp/my-rlm"}))

;; 2. Ingest documents (can call multiple times)
(rlm/ingest-to-env! env documents)
(rlm/ingest-to-env! env more-documents)

;; 3. Run queries (reuses same env)
(rlm/query-env! env "What is X?")
(rlm/query-env! env "Find Y" {:spec my-spec})

;; 4. Dispose when done
(rlm/dispose-env! env)
```

## Key Features

- Iterative code execution: LLM writes code, sees results, writes more code
- FINAL termination: LLM signals completion by returning {:FINAL result}
- Recursive llm-query: Code can call back to the LLM for sub-tasks
- Sandboxed evaluation: Uses SCI for safe, controlled code execution
- Documents: Complete structure stored exactly as-is:
- Documents with metadata
- Pages with page nodes (paragraphs, headings, images, tables)
- TOC entries
- Learnings: DB-backed meta-insights that persist across sessions
- Spec support: Define output shape, validate FINAL answers
- Auto-refinement: Self-critique loop improves answer quality

## LLM Available Functions (in SCI sandbox)

Document search:
 - (list-documents) - List all stored documents
 - (get-document doc-id) - Get document metadata
 - (search-page-nodes query) - List/filter actual content
 - (get-page-node node-id) - Get full page node content
 - (list-page-nodes opts) - List page nodes with filters
 - (search-toc-entries query) - List/filter table of contents
 - (get-toc-entry entry-id) - Get TOC entry
 - (list-toc-entries) - List all TOC entries
  
  Learnings:
 - (store-learning insight) - Store meta-insight
 - (search-learnings query) - Search learnings
 - (vote-learning id :useful/:not-useful) - Vote on learning
 
 History:
 - (search-history n) - Get recent messages (default 5)
 - (get-history n) - Get recent messages (default 10)
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.core

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:

  • build-index - Extract structure from file path or string content
  • index! - Index and save to EDN + PNG files
  • load-index - Load indexed document from EDN directory
  • inspect - Print full document summary with TOC tree
  • print-toc-tree - Print a formatted TOC tree from TOC entries

Supported file types:

  • PDF (.pdf) - Uses vision LLM for node-based extraction
  • Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
  • Plain text (.txt, .text) - Uses LLM for text extraction
  • Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:

  • Top-level headings (h1, or first level found) become page boundaries
  • Nested headings become section nodes within each page
  • No LLM required for structure extraction

Post-processing:

  1. Translates local node IDs to globally unique UUIDs
  2. If no TOC exists in document, generates one from Section/Heading structure
  3. Links TocEntry target-section-id to matching Section nodes
  4. Generates document abstract from all section descriptions using Chain of Density

Usage: (require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF (def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files (pageindex/index! "manual.pdf") ;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree) (pageindex/inspect "manual.pageindex")

Main API for RLM document indexing - extracts structured data from documents.

Primary functions:
- `build-index` - Extract structure from file path or string content
- `index!` - Index and save to EDN + PNG files
- `load-index` - Load indexed document from EDN directory
- `inspect` - Print full document summary with TOC tree
- `print-toc-tree` - Print a formatted TOC tree from TOC entries

Supported file types:
- PDF (.pdf) - Uses vision LLM for node-based extraction
- Markdown (.md, .markdown) - Parses heading structure into pages (no LLM needed)
- Plain text (.txt, .text) - Uses LLM for text extraction
- Images (.png, .jpg, .jpeg, .gif, .bmp, .webp) - Direct vision LLM extraction

Markdown files are parsed deterministically by heading structure:
- Top-level headings (h1, or first level found) become page boundaries
- Nested headings become section nodes within each page
- No LLM required for structure extraction

Post-processing:
1. Translates local node IDs to globally unique UUIDs
2. If no TOC exists in document, generates one from Section/Heading structure
3. Links TocEntry target-section-id to matching Section nodes
4. Generates document abstract from all section descriptions using Chain of Density

Usage:
(require '[com.blockether.svar.internal.rlm.internal.pageindex.core :as pageindex])

;; Index a PDF
(def doc (pageindex/build-index "manual.pdf"))

;; Index and save to EDN + PNG files
(pageindex/index! "manual.pdf")
;; => {:document {...} :output-path "manual.pageindex"}

;; Load and inspect (includes TOC tree)
(pageindex/inspect "manual.pageindex")
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.markdown

Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:

  • markdown->pages - Main API: convert markdown string to page-based format
  • markdown-file->pages - Convenience: reads file and calls markdown->pages

Design:

  • Top-level headings (h1, or first heading level found) become 'pages'
  • Nested headings become nodes within each page
  • Code blocks are skipped when parsing headings
  • Each section includes text from heading to next heading
  • No LLM required - deterministic parsing
Markdown parsing for RLM - extracts hierarchical structure from markdown files.

Primary functions:
- `markdown->pages` - Main API: convert markdown string to page-based format
- `markdown-file->pages` - Convenience: reads file and calls markdown->pages

Design:
- Top-level headings (h1, or first heading level found) become 'pages'
- Nested headings become nodes within each page
- Code blocks are skipped when parsing headings
- Each section includes text from heading to next heading
- No LLM required - deterministic parsing
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.pdf

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:

  • pdf->images - Convert PDF file to vector of BufferedImage objects
  • page-count - Get total page count of a PDF file
  • pdf-metadata - Extract PDF metadata (author, title, dates, etc.)
  • detect-text-rotation - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI. Handles error cases: encrypted PDFs, corrupted files, file not found.

PDF to images conversion and metadata extraction using Apache PDFBox.

Provides:
- `pdf->images` - Convert PDF file to vector of BufferedImage objects
- `page-count` - Get total page count of a PDF file
- `pdf-metadata` - Extract PDF metadata (author, title, dates, etc.)
- `detect-text-rotation` - Detect content rotation per page using text position heuristics

Uses PDFBox for reliable PDF rendering at configurable DPI.
Handles error cases: encrypted PDFs, corrupted files, file not found.
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.spec

Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide a clear view of the complete data model. Individual namespaces will require this namespace and use these specs for validation.

Data Model Philosophy:

  • FLAT structure with parent references (Datalevin-style)
  • All keywords are namespaced (:node/*, :page/*, :toc/*)
  • Vector of maps output (not nested trees)
  • :node/parent-id creates hierarchy (nil for root nodes)
Comprehensive clojure.spec definitions for RLM data structures.

This namespace centralizes ALL specs for the RLM system to provide
a clear view of the complete data model. Individual namespaces will require
this namespace and use these specs for validation.

Data Model Philosophy:
- FLAT structure with parent references (Datalevin-style)
- All keywords are namespaced (`:node/*`, `:page/*`, `:toc/*`)
- Vector of maps output (not nested trees)
- `:node/parent-id` creates hierarchy (nil for root nodes)
raw docstring

com.blockether.svar.internal.rlm.internal.pageindex.vision

Vision/LLM-based text extraction from documents.

Provides:

  • image->base64 - Convert BufferedImage to base64 PNG string
  • image->bytes - Convert BufferedImage to PNG byte array
  • image->bytes-region - Extract and convert a bounding-box region to PNG bytes
  • extract-image-region - Crop a BufferedImage to a bounding-box region
  • scale-and-clamp-bbox - Scale and clamp bounding box coordinates to image dimensions
  • extract-text-from-image - Extract structured nodes from a single BufferedImage (vision)
  • extract-text-from-pdf - Extract structured nodes from all pages of a PDF (vision)
  • extract-text-from-text-file - Extract from text/markdown file (LLM, no image rendering)
  • extract-text-from-image-file - Extract from image file (vision)
  • extract-text-from-string - Extract from string content (LLM, no image rendering)
  • infer-document-title - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps. Uses multimodal LLM for both image and text extraction. Parallel extraction using core.async channels for PDFs.

Vision/LLM-based text extraction from documents.

Provides:
- `image->base64` - Convert BufferedImage to base64 PNG string
- `image->bytes` - Convert BufferedImage to PNG byte array
- `image->bytes-region` - Extract and convert a bounding-box region to PNG bytes
- `extract-image-region` - Crop a BufferedImage to a bounding-box region
- `scale-and-clamp-bbox` - Scale and clamp bounding box coordinates to image dimensions
- `extract-text-from-image` - Extract structured nodes from a single BufferedImage (vision)
- `extract-text-from-pdf` - Extract structured nodes from all pages of a PDF (vision)
- `extract-text-from-text-file` - Extract from text/markdown file (LLM, no image rendering)
- `extract-text-from-image-file` - Extract from image file (vision)
- `extract-text-from-string` - Extract from string content (LLM, no image rendering)
- `infer-document-title` - Infer a document title from page content using LLM

Configuration is passed explicitly via opts maps.
Uses multimodal LLM for both image and text extraction.
Parallel extraction using core.async channels for PDFs.
raw docstring

com.blockether.svar.internal.spec

Structured output specification system for LLM responses.

This namespace provides a DSL for defining expected output structures, converting specs to LLM prompts, and parsing LLM responses back to Clojure data.

Primary functions:

  • field - Define a field with name, type, cardinality, and description
  • spec - Create a spec from field definitions
  • build-ref-registry - Build a registry of referenced specs for nested types
  • spec->prompt - Generate LLM prompt text from a spec (sent to LLM)
  • str->data - Parse LLM response string to Clojure data (schemaless)
  • str->data-with-spec - Parse LLM response with spec-based type coercion
  • validate-data - Validate parsed data against a spec
  • data->str - Serialize Clojure data to JSON string

Data Flow:

  1. Define spec with spec and field functions
  2. Generate prompt with spec->prompt (sent to LLM)
  3. Parse response with str->data-with-spec (LLM response -> typed Clojure map)
  4. Optionally validate with validate-data
  5. Optionally serialize with data->str
Structured output specification system for LLM responses.

This namespace provides a DSL for defining expected output structures,
converting specs to LLM prompts, and parsing LLM responses back to Clojure data.

Primary functions:
- `field` - Define a field with name, type, cardinality, and description
- `spec` - Create a spec from field definitions
- `build-ref-registry` - Build a registry of referenced specs for nested types
- `spec->prompt` - Generate LLM prompt text from a spec (sent to LLM)
- `str->data` - Parse LLM response string to Clojure data (schemaless)
- `str->data-with-spec` - Parse LLM response with spec-based type coercion
- `validate-data` - Validate parsed data against a spec
- `data->str` - Serialize Clojure data to JSON string

Data Flow:
1. Define spec with `spec` and `field` functions
2. Generate prompt with `spec->prompt` (sent to LLM)
3. Parse response with `str->data-with-spec` (LLM response -> typed Clojure map)
4. Optionally validate with `validate-data`
5. Optionally serialize with `data->str`
raw docstring

com.blockether.svar.internal.tokens

Token counting utilities for LLM API interactions.

Based on JTokkit (https://github.com/knuddelsgmbh/jtokkit) - a Java implementation of OpenAI's TikToken tokenizer.

Provides:

  • count-tokens - Count tokens for a string using a specific model's encoding
  • count-messages - Count tokens for a chat completion message array
  • estimate-cost - Estimate cost in USD based on model pricing
  • count-and-estimate - Count tokens and estimate cost in one call
  • context-limit - Get max context window for a model
  • max-input-tokens - Get max input tokens (context minus output reserve)
  • truncate-text - Token-aware text truncation
  • truncate-messages - Smart message truncation with priority
  • check-context-limit - Pre-flight check before API calls
  • format-cost - Format USD cost for display
  • get-model-pricing - Look up per-model pricing info

Note: Token counts are approximate. Chat completion API payloads have ~25 token error margin due to internal OpenAI formatting that isn't publicly documented.

Token counting utilities for LLM API interactions.

Based on JTokkit (https://github.com/knuddelsgmbh/jtokkit) - a Java implementation
of OpenAI's TikToken tokenizer.

Provides:
- `count-tokens` - Count tokens for a string using a specific model's encoding
- `count-messages` - Count tokens for a chat completion message array
- `estimate-cost` - Estimate cost in USD based on model pricing
- `count-and-estimate` - Count tokens and estimate cost in one call
- `context-limit` - Get max context window for a model
- `max-input-tokens` - Get max input tokens (context minus output reserve)
- `truncate-text` - Token-aware text truncation
 - `truncate-messages` - Smart message truncation with priority
 - `check-context-limit` - Pre-flight check before API calls
 - `format-cost` - Format USD cost for display
- `get-model-pricing` - Look up per-model pricing info

Note: Token counts are approximate. Chat completion API payloads have ~25 token
error margin due to internal OpenAI formatting that isn't publicly documented.
raw docstring

com.blockether.svar.internal.util

Shared internal utilities.

Shared internal utilities.
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close