Liking cljdoc? Tell your friends :D

tokenizers-clj

Idiomatic Clojure tokenization: encode, decode, and count tokens against any HuggingFace tokenizer.json, backed by the native Rust tokenizers library.

Stack

Clojure JVM HuggingFace Tokenizers

Clojars Project

A thin Clojure wrapper over DJL's ai.djl.huggingface/tokenizers, which binds the same fast Rust tokenizers that HuggingFace ships for Python. It gives you exact token counts and ids for BERT, GPT, Llama, Qwen, and any other model that publishes a tokenizer.json - no Python, no network at runtime once the model file is local.

Install

Leiningen / Boot:

[net.clojars.savya/tokenizers-clj "0.1.0"]

deps.edn:

net.clojars.savya/tokenizers-clj {:mvn/version "0.1.0"}

Usage

(require '[tokenizers.core :as tok])

;; From a local tokenizer.json ...
(with-open [t (tok/from-file "bert-base-uncased/tokenizer.json")]
  (tok/count-tokens t "Hello, world!"))          ;=> 6

;; ... or straight from the HuggingFace hub (downloads + caches once).
(with-open [t (tok/from-pretrained "bert-base-uncased")]
  (tok/encode t "Hello, world!"))
;=> {:ids [101 7592 1010 2088 999 102]
;    :tokens ["[CLS]" "hello" "," "world" "!" "[SEP]"]
;    :attention-mask [1 1 1 1 1 1]
;    :type-ids [0 0 0 0 0 0] :word-ids [...] :special-tokens-mask [1 0 0 0 0 1]}

;; Drop the framing special tokens for a raw count:
(with-open [t (tok/from-pretrained "bert-base-uncased")]
  (tok/count-tokens t "Hello, world!" {:add-special-tokens? false}))  ;=> 4

;; Round-trip:
(with-open [t (tok/from-pretrained "bert-base-uncased")]
  (tok/decode t (tok/ids t "hello there" {:add-special-tokens? false})))  ;=> "hello there"

batch-encode pads every sequence to the batch's longest so the result is rectangular; real token counts are recoverable from each :attention-mask.

Requirements

  • JDK 8+.
  • A JVM matching your CPU architecture. DJL loads a native library for the JVM's reported os.arch, so on Apple Silicon use an arm64 JDK - an x86_64 JVM running under Rosetta fails to resolve the native tokenizer (Unexpected flavor: cpu).
  • Network access the first time DJL fetches the native library (cached afterwards under ~/.djl.ai/), and on from-pretrained to download the model file.

License

Copyright (c) 2026 Savyasachi. Released under the Eclipse Public License 1.0.

Wraps Deep Java Library (Apache-2.0) and the HuggingFace tokenizers library (Apache-2.0).

Can you improve this documentation?Edit on GitHub

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close