Liking cljdoc? Tell your friends :D

chunk.core

Recursive text splitting (chunking) for RAG / LLM pipelines.

split breaks text into overlapping chunks no larger than a target size, trying an ordered list of separators from coarsest (paragraph) to finest (character) so chunks land on natural boundaries. Size is measured by :length-fn (default count = characters); pass a token counter (e.g. tokenizers-clj's count-tokens) to chunk by tokens instead - the correct unit when feeding a model with a token limit.

Recursive text splitting (chunking) for RAG / LLM pipelines.

`split` breaks text into overlapping chunks no larger than a target size, trying an
ordered list of separators from coarsest (paragraph) to finest (character) so chunks
land on natural boundaries. Size is measured by `:length-fn` (default `count` =
characters); pass a token counter (e.g. tokenizers-clj's `count-tokens`) to chunk by
tokens instead - the correct unit when feeding a model with a token limit.
raw docstring

default-separatorsclj

Coarsest-to-finest split boundaries (the empty string splits into characters).

Coarsest-to-finest split boundaries (the empty string splits into characters).
sourceraw docstring

splitclj

(split text)
(split
  text
  {:keys [chunk-size overlap separators length-fn]
   :or
     {chunk-size 1000 overlap 0 separators default-separators length-fn count}})

Split text into a vector of chunk strings.

Options:

  • :chunk-size max size of a chunk, in :length-fn units (default 1000)
  • :overlap size of trailing context repeated at the start of the next chunk (default 0)
  • :separators ordered split boundaries, coarsest first (default default-separators)
  • :length-fn measures a string's size; default count (characters). Pass a token counter to chunk by tokens.

A piece with no admissible finer separator (an "atom" longer than :chunk-size) is emitted whole rather than dropped.

Split `text` into a vector of chunk strings.

Options:
- `:chunk-size` max size of a chunk, in `:length-fn` units (default 1000)
- `:overlap`    size of trailing context repeated at the start of the next chunk (default 0)
- `:separators` ordered split boundaries, coarsest first (default `default-separators`)
- `:length-fn`  measures a string's size; default `count` (characters). Pass a token
                counter to chunk by tokens.

A piece with no admissible finer separator (an "atom" longer than `:chunk-size`) is
emitted whole rather than dropped.
sourceraw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close