Liking cljdoc? Tell your friends :D

easy-onnx.tokenizer


closeclj

(close tokenizer)

Close the Tokenizer. Same as (.close tokenizer).

Close the Tokenizer. Same as `(.close tokenizer)`.
raw docstring

componentclj

(component config)

Build an unstarted Tokenizer for use in a Stuart Sierra Component system. See create for accepted config keys.

Build an unstarted Tokenizer for use in a Stuart Sierra Component system.
See `create` for accepted config keys.
raw docstring

Configclj


createclj

(create config)

Build and start a Tokenizer. Use with with-open for one-shot use.

Required: :tokenizer-path - Path to a HuggingFace tokenizer.json file.

Optional builder settings (passed through to DJL's HuggingFaceTokenizer): :truncation? - boolean. Enable truncation to :max-length. Note: DJL truncates at :max-length even without this set; set explicitly to be safe across DJL versions. :max-length - int. Maximum sequence length. :padding? - boolean. Pad to the longest input in a batch (no-op for single-text encoding). :pad-to-max-length? - boolean. Pad encoded sequences out to :max-length. Useful for fixed-shape tensor inputs. :pad-to-multiple-of - int. Pad sequence length up to a multiple of this. :add-special-tokens? - boolean. Wrap with [CLS]/[SEP] (default per model).

Build and start a Tokenizer. Use with `with-open` for one-shot use.

Required:
  :tokenizer-path - Path to a HuggingFace tokenizer.json file.

Optional builder settings (passed through to DJL's HuggingFaceTokenizer):
  :truncation?         - boolean. Enable truncation to :max-length. Note:
                         DJL truncates at :max-length even without this set;
                         set explicitly to be safe across DJL versions.
  :max-length          - int. Maximum sequence length.
  :padding?            - boolean. Pad to the longest input in a batch
                         (no-op for single-text encoding).
  :pad-to-max-length?  - boolean. Pad encoded sequences out to :max-length.
                         Useful for fixed-shape tensor inputs.
  :pad-to-multiple-of  - int. Pad sequence length up to a multiple of this.
  :add-special-tokens? - boolean. Wrap with [CLS]/[SEP] (default per model).
raw docstring

encodeclj

(encode {:keys [tokenizer]} text)

Encode text using the underlying HuggingFaceTokenizer. Returns the raw DJL Encoding; convert with get-ids / get-mask as needed.

Encode `text` using the underlying HuggingFaceTokenizer.
Returns the raw DJL Encoding; convert with get-ids / get-mask as needed.
raw docstring

get-idsclj

(get-ids encoding)

Extract the long[] of token ids from an Encoding.

Extract the long[] of token ids from an Encoding.
raw docstring

get-maskclj

(get-mask encoding)

Extract the long[] attention mask from an Encoding.

Extract the long[] attention mask from an Encoding.
raw docstring

cljdoc builds & hosts documentation for Clojure/Script libraries

Keyboard shortcuts
Ctrl+kJump to recent docs
Move to previous article
Move to next article
Ctrl+/Jump to the search field
× close