bosquet.nlp.splitter

Liking cljdoc? Tell your friends :D

Clojure only.

character-splitter
en-sentence-detector
sentence-splitter
split-max-tokens
text->characters
text->sentences
text-chunker
text-splitter
text<-characters
text<-sentences

character-splitter^clj

Text splitter by individual characters. Text will be turned into array of characers.

Text splitter by individual characters. Text will be turned into array of characers.

raw docstring

en-sentence-detector^clj

English sentence splitting model

https://opennlp.apache.org/models.html

English sentence splitting model

https://opennlp.apache.org/models.html

raw docstring

sentence-splitter^clj

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition the text into a vector of sentences

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition
the text into a vector of sentences

raw docstring

split-max-tokens^clj

(split-max-tokens text max-tokens model)

(split-max-tokens text max-tokens model split-char)

Splits a given string text in several sub-strings. Each split will have maximum length, while having less tokens then max-token. The numer of tokens of a substring gets obtained by calling token-count-fn with the current string repeatedly while growing the string. Initialy the text is split by split-char, which should nearly always be a form of whitespace. Then the substring are grown by the result of the splitting. Keeping split-char as whitespace avoids that words get split in the middle.

In very rare situations where the text has words longer then max-token, the function might return substrings which have more tokens then max-token. In this case split-char could be modified to split on something else then word boundaries, which will then eventualy break words in the middle, but would guaranty that substrings do not have more then max-token tokens.

Splits a given string `text` in several sub-strings.
Each split will have maximum length, while having less tokens then `max-token`.
The numer of tokens of a substring gets obtained by calling `token-count-fn` with the current string repeatedly
while growing the string.
Initialy the text is split by `split-char`, which should nearly always be a form of whitespace.
Then the substring are grown by the result of the splitting.
Keeping `split-char` as whitespace avoids that words get split in the middle.

In very rare situations where the text has words longer then `max-token`, the function
might return substrings which have more tokens then `max-token`. In this case `split-char` could be modified
to split on something else then word boundaries, which will then eventualy break words in the middle,
but would guaranty that substrings do not have more then `max-token` tokens.

raw docstring

text->characters^clj

(text->characters text)

text->sentences^clj

(text->sentences text)

Split text into sentences using OpenNLP sentence splitting model

Split `text` into sentences using OpenNLP sentence splitting model

raw docstring

text-chunker^clj

(text-chunker {:keys [splitter] :as opts} text)

Chunk text into chunk-size blocks using specified splitter. Optionaly overlap can be specified by how many text units chunks can overap (defaults to 0).

TODO overlap is currently failing, see unit-test

Supported text splitters:

sentence-splitter
character-splitter
TODO token-splitter

Chunk `text` into `chunk-size` blocks using specified `splitter`. Optionaly
`overlap` can be specified by how many text units chunks can overap (defaults to 0).

TODO `overlap` is currently failing, see unit-test

Supported text splitters:
- `sentence-splitter`
- `character-splitter`
- TODO `token-splitter`

raw docstring

text-splitter^clj

(text-splitter {:keys [chunk-size overlap] :or {overlap 0}} text-units)

text<-characters^clj

(text<-characters chars)

text<-sentences^clj

(text<-sentences sentences)

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close