Recursive text splitting (chunking) for RAG / LLM pipelines.
split breaks text into overlapping chunks no larger than a target size, trying an
ordered list of separators from coarsest (paragraph) to finest (character) so chunks
land on natural boundaries. Size is measured by :length-fn (default count =
characters); pass a token counter (e.g. tokenizers-clj's count-tokens) to chunk by
tokens instead - the correct unit when feeding a model with a token limit.
Recursive text splitting (chunking) for RAG / LLM pipelines. `split` breaks text into overlapping chunks no larger than a target size, trying an ordered list of separators from coarsest (paragraph) to finest (character) so chunks land on natural boundaries. Size is measured by `:length-fn` (default `count` = characters); pass a token counter (e.g. tokenizers-clj's `count-tokens`) to chunk by tokens instead - the correct unit when feeding a model with a token limit.
Coarsest-to-finest split boundaries (the empty string splits into characters).
Coarsest-to-finest split boundaries (the empty string splits into characters).
(split text)(split
text
{:keys [chunk-size overlap separators length-fn]
:or
{chunk-size 1000 overlap 0 separators default-separators length-fn count}})Split text into a vector of chunk strings.
Options:
:chunk-size max size of a chunk, in :length-fn units (default 1000):overlap size of trailing context repeated at the start of the next chunk (default 0):separators ordered split boundaries, coarsest first (default default-separators):length-fn measures a string's size; default count (characters). Pass a token
counter to chunk by tokens.A piece with no admissible finer separator (an "atom" longer than :chunk-size) is
emitted whole rather than dropped.
Split `text` into a vector of chunk strings.
Options:
- `:chunk-size` max size of a chunk, in `:length-fn` units (default 1000)
- `:overlap` size of trailing context repeated at the start of the next chunk (default 0)
- `:separators` ordered split boundaries, coarsest first (default `default-separators`)
- `:length-fn` measures a string's size; default `count` (characters). Pass a token
counter to chunk by tokens.
A piece with no admissible finer separator (an "atom" longer than `:chunk-size`) is
emitted whole rather than dropped.cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |