Idiomatic Clojure tokenization: encode, decode, and count tokens against any
HuggingFace tokenizer.json, backed by the native Rust tokenizers library.
A thin Clojure wrapper over DJL's
ai.djl.huggingface/tokenizers, which binds the same fast Rust
tokenizers that HuggingFace ships
for Python. It gives you exact token counts and ids for BERT, GPT, Llama, Qwen,
and any other model that publishes a tokenizer.json - no Python, no network at
runtime once the model file is local.
Leiningen / Boot:
[net.clojars.savya/tokenizers-clj "0.1.0"]
deps.edn:
net.clojars.savya/tokenizers-clj {:mvn/version "0.1.0"}
(require '[tokenizers.core :as tok])
;; From a local tokenizer.json ...
(with-open [t (tok/from-file "bert-base-uncased/tokenizer.json")]
(tok/count-tokens t "Hello, world!")) ;=> 6
;; ... or straight from the HuggingFace hub (downloads + caches once).
(with-open [t (tok/from-pretrained "bert-base-uncased")]
(tok/encode t "Hello, world!"))
;=> {:ids [101 7592 1010 2088 999 102]
; :tokens ["[CLS]" "hello" "," "world" "!" "[SEP]"]
; :attention-mask [1 1 1 1 1 1]
; :type-ids [0 0 0 0 0 0] :word-ids [...] :special-tokens-mask [1 0 0 0 0 1]}
;; Drop the framing special tokens for a raw count:
(with-open [t (tok/from-pretrained "bert-base-uncased")]
(tok/count-tokens t "Hello, world!" {:add-special-tokens? false})) ;=> 4
;; Round-trip:
(with-open [t (tok/from-pretrained "bert-base-uncased")]
(tok/decode t (tok/ids t "hello there" {:add-special-tokens? false}))) ;=> "hello there"
batch-encode pads every sequence to the batch's longest so the result is
rectangular; real token counts are recoverable from each :attention-mask.
os.arch, so on Apple Silicon use an arm64 JDK - an x86_64
JVM running under Rosetta fails to resolve the native tokenizer
(Unexpected flavor: cpu).~/.djl.ai/), and on from-pretrained to download the model
file.Copyright (c) 2026 Savyasachi. Released under the Eclipse Public License 1.0.
Wraps Deep Java Library (Apache-2.0) and the HuggingFace
tokenizers library (Apache-2.0).
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |