Text splitter by individual characters. Text will be turned into array of characers.
Text splitter by individual characters. Text will be turned into array of characers.
English sentence splitting model
English sentence splitting model https://opennlp.apache.org/models.html
Text splitter by sentences. It will use OpenNLP sentnce splitter to partition the text into a vector of sentences
Text splitter by sentences. It will use OpenNLP sentnce splitter to partition the text into a vector of sentences
(split-max-tokens text max-tokens model)
(split-max-tokens text max-tokens model split-char)
Splits a given string text
in several sub-strings.
Each split will have maximum length, while having less tokens then max-token
.
The numer of tokens of a substring gets obtained by calling token-count-fn
with the current string repeatedly
while growing the string.
Initialy the text is split by split-char
, which should nearly always be a form of whitespace.
Then the substring are grown by the result of the splitting.
Keeping split-char
as whitespace avoids that words get split in the middle.
In very rare situations where the text has words longer then max-token
, the function
might return substrings which have more tokens then max-token
. In this case split-char
could be modified
to split on something else then word boundaries, which will then eventualy break words in the middle,
but would guaranty that substrings do not have more then max-token
tokens.
Splits a given string `text` in several sub-strings. Each split will have maximum length, while having less tokens then `max-token`. The numer of tokens of a substring gets obtained by calling `token-count-fn` with the current string repeatedly while growing the string. Initialy the text is split by `split-char`, which should nearly always be a form of whitespace. Then the substring are grown by the result of the splitting. Keeping `split-char` as whitespace avoids that words get split in the middle. In very rare situations where the text has words longer then `max-token`, the function might return substrings which have more tokens then `max-token`. In this case `split-char` could be modified to split on something else then word boundaries, which will then eventualy break words in the middle, but would guaranty that substrings do not have more then `max-token` tokens.
(text->characters text)
(text->sentences text)
Split text
into sentences using OpenNLP sentence splitting model
Split `text` into sentences using OpenNLP sentence splitting model
(text-chunker {:keys [splitter] :as opts} text)
Chunk text
into chunk-size
blocks using specified splitter
. Optionaly
overlap
can be specified by how many text units chunks can overap (defaults to 0).
TODO overlap
is currently failing, see unit-test
Supported text splitters:
sentence-splitter
character-splitter
token-splitter
Chunk `text` into `chunk-size` blocks using specified `splitter`. Optionaly `overlap` can be specified by how many text units chunks can overap (defaults to 0). TODO `overlap` is currently failing, see unit-test Supported text splitters: - `sentence-splitter` - `character-splitter` - TODO `token-splitter`
(text-splitter {:keys [chunk-size overlap] :or {overlap 0}} text-units)
(text<-characters chars)
(text<-sentences sentences)
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close