(close tokenizer)Close the Tokenizer. Same as (.close tokenizer).
Close the Tokenizer. Same as `(.close tokenizer)`.
(component config)Build an unstarted Tokenizer for use in a Stuart Sierra Component system.
See create for accepted config keys.
Build an unstarted Tokenizer for use in a Stuart Sierra Component system. See `create` for accepted config keys.
(create config)Build and start a Tokenizer. Use with with-open for one-shot use.
Required: :tokenizer-path - Path to a HuggingFace tokenizer.json file.
Optional builder settings (passed through to DJL's HuggingFaceTokenizer): :truncation? - boolean. Enable truncation to :max-length. Note: DJL truncates at :max-length even without this set; set explicitly to be safe across DJL versions. :max-length - int. Maximum sequence length. :padding? - boolean. Pad to the longest input in a batch (no-op for single-text encoding). :pad-to-max-length? - boolean. Pad encoded sequences out to :max-length. Useful for fixed-shape tensor inputs. :pad-to-multiple-of - int. Pad sequence length up to a multiple of this. :add-special-tokens? - boolean. Wrap with [CLS]/[SEP] (default per model).
Build and start a Tokenizer. Use with `with-open` for one-shot use.
Required:
:tokenizer-path - Path to a HuggingFace tokenizer.json file.
Optional builder settings (passed through to DJL's HuggingFaceTokenizer):
:truncation? - boolean. Enable truncation to :max-length. Note:
DJL truncates at :max-length even without this set;
set explicitly to be safe across DJL versions.
:max-length - int. Maximum sequence length.
:padding? - boolean. Pad to the longest input in a batch
(no-op for single-text encoding).
:pad-to-max-length? - boolean. Pad encoded sequences out to :max-length.
Useful for fixed-shape tensor inputs.
:pad-to-multiple-of - int. Pad sequence length up to a multiple of this.
:add-special-tokens? - boolean. Wrap with [CLS]/[SEP] (default per model).(encode {:keys [tokenizer]} text)Encode text using the underlying HuggingFaceTokenizer.
Returns the raw DJL Encoding; convert with get-ids / get-mask as needed.
Encode `text` using the underlying HuggingFaceTokenizer. Returns the raw DJL Encoding; convert with get-ids / get-mask as needed.
(get-ids encoding)Extract the long[] of token ids from an Encoding.
Extract the long[] of token ids from an Encoding.
(get-mask encoding)Extract the long[] attention mask from an Encoding.
Extract the long[] attention mask from an Encoding.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |