Liking cljdoc? Tell your friends :D

chunk-clj

Recursive text splitting (chunking) for RAG and LLM pipelines - split text on natural boundaries into overlapping chunks sized by characters or tokens. Pure Clojure, zero dependencies.

Stack

The standard way to prepare documents for retrieval is to split them into chunks that fit a model's context, on sensible boundaries, with a little overlap so a thought split across two chunks survives in both. chunk-clj is the Clojure equivalent of LangChain's RecursiveCharacterTextSplitter: it tries a list of separators from coarsest (paragraph) to finest (character) until each chunk fits, then packs and overlaps them.

The size limit is whatever you want it to be - :length-fn defaults to characters, but since models limit you by tokens, pass a token counter (e.g. tokenizers-clj) and chunk by real token budgets.

Install

Leiningen / Boot:

[net.clojars.savya/chunk-clj "0.1.0"]

deps.edn:

net.clojars.savya/chunk-clj {:mvn/version "0.1.0"}

Usage

(require '[chunk.core :as chunk])

;; Character-sized chunks with overlap (the default):
(chunk/split long-text {:chunk-size 1000 :overlap 200})
;=> ["first ~1000-char chunk ..." "next chunk, sharing ~200 chars ..." ...]

;; Short text is returned whole:
(chunk/split "hello world" {:chunk-size 100})
;=> ["hello world"]

;; Custom separators (e.g. split markdown on headings first):
(chunk/split doc {:chunk-size 800 :separators ["\n## " "\n\n" "\n" " " ""]})

Chunk by tokens (the right way)

Models cap input by tokens, not characters, so size chunks with a real tokenizer:

(require '[chunk.core :as chunk]
         '[tokenizers.core :as tok])

(with-open [t (tok/from-pretrained "bert-base-uncased")]
  (chunk/split long-text {:chunk-size 256          ; 256 tokens, not chars
                          :overlap    32
                          :length-fn  #(tok/count-tokens t %)}))

Any String -> number function works as :length-fn, so you can target an embedding model's exact token limit.

Options

Key	Default	Meaning
`:chunk-size`	`1000`	Max chunk size, in `:length-fn` units
`:overlap`	`0`	Trailing context repeated at the start of the next chunk
`:separators`	`["\n\n" "\n" " " ""]`	Ordered split boundaries, coarsest first
`:length-fn`	`count`	Measures a string's size (swap in a token counter)

An "atom" longer than :chunk-size with no admissible finer separator (e.g. one huge word when "" is not in :separators) is emitted whole rather than dropped.

License

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

Raise an issue Browse cljdoc source Chat on Slack

× close