(create-tokkit-gpt-token-count-fn encoding-type)
Make a fn which counts the tokens for a given string using the encoding-type. Should result in the same token count as the GPT API
Make a fn which counts the tokens for a given string using the encoding-type. Should result in the same token count as the GPT API
(heuristic-gpt-token-count-fn s)
Uses a heuristic to count the tokens for a given string. Should work for most GPT based models.
Uses a heuristic to count the tokens for a given string. Should work for most GPT based models.
(split-max-tokens text max-tokens token-count-fn)
(split-max-tokens text max-tokens token-count-fn split-char)
Splits a given string text
in several sub-strings.
Each split will have maximum length, while having less tokens then max-token
.
The numer of tokens of a substring gets obtained by calling token-count-fn
with the current string repeatedly
while growing the string.
Initialy the text is split by split-char
, which should nearly always be a form of whitespace.
Then the substring are grown by the result of the splitting.
Keeping split-char
as whitespace avoids that words get split in the middle.
In very rare situations where the text has words longer then max-token
, the function
might return substrings which have more tokens then max-token
. In this case split-char
could be modified
to split on something else then word boundaries, which will then eventualy break words in the middle,
but would guaranty that substrings do not have more then max-token
tokens.
Splits a given string `text` in several sub-strings. Each split will have maximum length, while having less tokens then `max-token`. The numer of tokens of a substring gets obtained by calling `token-count-fn` with the current string repeatedly while growing the string. Initialy the text is split by `split-char`, which should nearly always be a form of whitespace. Then the substring are grown by the result of the splitting. Keeping `split-char` as whitespace avoids that words get split in the middle. In very rare situations where the text has words longer then `max-token`, the function might return substrings which have more tokens then `max-token`. In this case `split-char` could be modified to split on something else then word boundaries, which will then eventualy break words in the middle, but would guaranty that substrings do not have more then `max-token` tokens.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close