Liking cljdoc? Tell your friends :D

tokenizers.core

Idiomatic Clojure wrapper over DJL's HuggingFace tokenizers, which bind the native Rust tokenizers library via JNI. Build a tokenizer with from-file / from-pretrained / from-stream, then encode, decode, or count-tokens.

A tokenizer holds a native handle: close it (with-open works) to free it.

Idiomatic Clojure wrapper over DJL's HuggingFace tokenizers, which bind the native
Rust `tokenizers` library via JNI. Build a tokenizer with `from-file` /
`from-pretrained` / `from-stream`, then `encode`, `decode`, or `count-tokens`.

A tokenizer holds a native handle: close it (`with-open` works) to free it.
raw docstring

batch-encodeclj

(batch-encode t texts)

Encode many texts at once, returning a vector of encode-shaped maps.

Encode many `texts` at once, returning a vector of `encode`-shaped maps.
sourceraw docstring

count-tokensclj

(count-tokens t text)
(count-tokens t text opts)

Number of token ids text encodes to (see encode for opts).

Number of token ids `text` encodes to (see `encode` for opts).
sourceraw docstring

decodeclj

(decode t id-seq)
(decode t id-seq {:keys [skip-special-tokens?] :or {skip-special-tokens? true}})

Decode a seq of token id-seq back to text. Opts: :skip-special-tokens? (default true).

Decode a seq of token `id-seq` back to text. Opts: `:skip-special-tokens?` (default true).
sourceraw docstring

encodeclj

(encode t text)
(encode t
        text
        {:keys [add-special-tokens? with-overflowing-tokens?]
         :or {add-special-tokens? true with-overflowing-tokens? false}})

Encode text into a map of :ids :tokens :type-ids :word-ids :attention-mask :special-tokens-mask. Opts: :add-special-tokens? (default true), :with-overflowing-tokens? (default false).

Encode `text` into a map of `:ids :tokens :type-ids :word-ids :attention-mask
:special-tokens-mask`. Opts: `:add-special-tokens?` (default true),
`:with-overflowing-tokens?` (default false).
sourceraw docstring

from-fileclj

(from-file path)

Tokenizer from a tokenizer.json (path string, File, or Path).

Tokenizer from a `tokenizer.json` (path string, `File`, or `Path`).
sourceraw docstring

from-pretrainedclj

(from-pretrained id)

Tokenizer by HuggingFace hub id, e.g. "bert-base-uncased". Downloads then caches. Needs network on first use.

Tokenizer by HuggingFace hub id, e.g. "bert-base-uncased". Downloads then caches.
Needs network on first use.
sourceraw docstring

from-streamclj

(from-stream is)

Tokenizer from an InputStream over a tokenizer.json.

Tokenizer from an `InputStream` over a `tokenizer.json`.
sourceraw docstring

idsclj

(ids t text)
(ids t text opts)

Token ids for text (see encode for opts).

Token ids for `text` (see `encode` for opts).
sourceraw docstring

tokensclj

(tokens t text)
(tokens t text opts)

Token strings for text (see encode for opts).

Token strings for `text` (see `encode` for opts).
sourceraw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close