Idiomatic Clojure wrapper over DJL's HuggingFace tokenizers, which bind the native
Rust tokenizers library via JNI. Build a tokenizer with from-file /
from-pretrained / from-stream, then encode, decode, or count-tokens.
A tokenizer holds a native handle: close it (with-open works) to free it.
Idiomatic Clojure wrapper over DJL's HuggingFace tokenizers, which bind the native Rust `tokenizers` library via JNI. Build a tokenizer with `from-file` / `from-pretrained` / `from-stream`, then `encode`, `decode`, or `count-tokens`. A tokenizer holds a native handle: close it (`with-open` works) to free it.
(batch-encode t texts)Encode many texts at once, returning a vector of encode-shaped maps.
Encode many `texts` at once, returning a vector of `encode`-shaped maps.
(count-tokens t text)(count-tokens t text opts)Number of token ids text encodes to (see encode for opts).
Number of token ids `text` encodes to (see `encode` for opts).
(decode t id-seq)(decode t id-seq {:keys [skip-special-tokens?] :or {skip-special-tokens? true}})Decode a seq of token id-seq back to text. Opts: :skip-special-tokens? (default true).
Decode a seq of token `id-seq` back to text. Opts: `:skip-special-tokens?` (default true).
(encode t text)(encode t
text
{:keys [add-special-tokens? with-overflowing-tokens?]
:or {add-special-tokens? true with-overflowing-tokens? false}})Encode text into a map of :ids :tokens :type-ids :word-ids :attention-mask :special-tokens-mask. Opts: :add-special-tokens? (default true),
:with-overflowing-tokens? (default false).
Encode `text` into a map of `:ids :tokens :type-ids :word-ids :attention-mask :special-tokens-mask`. Opts: `:add-special-tokens?` (default true), `:with-overflowing-tokens?` (default false).
(from-file path)Tokenizer from a tokenizer.json (path string, File, or Path).
Tokenizer from a `tokenizer.json` (path string, `File`, or `Path`).
(from-pretrained id)Tokenizer by HuggingFace hub id, e.g. "bert-base-uncased". Downloads then caches. Needs network on first use.
Tokenizer by HuggingFace hub id, e.g. "bert-base-uncased". Downloads then caches. Needs network on first use.
(from-stream is)Tokenizer from an InputStream over a tokenizer.json.
Tokenizer from an `InputStream` over a `tokenizer.json`.
(ids t text)(ids t text opts)Token ids for text (see encode for opts).
Token ids for `text` (see `encode` for opts).
(tokens t text)(tokens t text opts)Token strings for text (see encode for opts).
Token strings for `text` (see `encode` for opts).
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |