Liking cljdoc? Tell your friends :D

bosquet.splitter


create-tokkit-gpt-token-count-fnclj

(create-tokkit-gpt-token-count-fn encoding-type)

Make a fn which counts the tokens for a given string using the encoding-type. Should result in the same token count as the GPT API

Make a fn which counts the tokens for a given string using the encoding-type.
Should result in the same token count as the GPT API
raw docstring

heuristic-gpt-token-count-fnclj

(heuristic-gpt-token-count-fn s)

Uses a heuristic to count the tokens for a given string. Should work for most GPT based models.

Uses a heuristic to count the tokens for a given string.
Should work for most GPT based models.
raw docstring

split-max-tokensclj

(split-max-tokens text max-tokens token-count-fn)
(split-max-tokens text max-tokens token-count-fn split-char)

Splits a given string text in several sub-strings. Each split will have maximum length, while having less tokens then max-token. The numer of tokens of a substring gets obtained by calling token-count-fn with the current string repeatedly while growing the string. Initialy the text is split by split-char, which should nearly always be a form of whitespace. Then the substring are grown by the result of the splitting. Keeping split-char as whitespace avoids that words get split in the middle.

In very rare situations where the text has words longer then max-token, the function might return substrings which have more tokens then max-token. In this case split-char could be modified to split on something else then word boundaries, which will then eventualy break words in the middle, but would guaranty that substrings do not have more then max-token tokens.

Splits a given string `text` in several sub-strings.
Each split will have maximum length, while having less tokens then `max-token`.
The numer of tokens of a substring gets obtained by calling `token-count-fn` with the current string repeatedly
while growing the string.
Initialy the text is split by `split-char`, which should nearly always be a form of whitespace.
Then the substring are grown by the result of the splitting.
Keeping `split-char` as whitespace avoids that words get split in the middle.

In very rare situations where the text has words longer then `max-token`, the function
might return substrings which have more tokens then `max-token`. In this case `split-char` could be modified
to split on something else then word boundaries, which will then eventualy break words in the middle,
but would guaranty that substrings do not have more then `max-token` tokens.
raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close