Liking cljdoc? Tell your friends :D

bosquet.nlp.splitter


characterclj

Text splitter by individual characters.

Text splitter by individual characters.
sourceraw docstring

chunk-sizeclj

Since of the chunks in which the text gets split.

Since of the chunks in which the text gets split.
sourceraw docstring

chunk-textclj

(chunk-text {:splitter/keys [unit] :as opts} text)

Chunk text into chunk-size blocks using specified splitter. Optionaly overlap can be specified by how many text units chunks can overap (defaults to 0).

Supported text splitters:

  • sentence-splitter
  • character-splitter
  • token-splitter
Chunk `text` into `chunk-size` blocks using specified `splitter`. Optionaly
`overlap` can be specified by how many text units chunks can overap (defaults to 0).

Supported text splitters:
- `sentence-splitter`
- `character-splitter`
- `token-splitter`
sourceraw docstring

modelclj

Model used for tokenization.

Model used for tokenization.
sourceraw docstring

overlapclj

Number of units by which chunks can overlap.

Number of units by which chunks can overlap.
sourceraw docstring

sentenceclj

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition the text.

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition
the text.
sourceraw docstring

split-handlersclj

Split handlers are needed to turn text into specified text units via encode function. decode function will turn those units back into single text string.

Split handlers are needed to turn text into specified text units via `encode` function.
`decode` function will turn those units back into single text string.
sourceraw docstring

split-unitclj

Lexical units in which the text gets split: character, token, sentence.

Lexical units in which the text gets split: character, token, sentence.
sourceraw docstring

text-splitterclj

(text-splitter {:splitter/keys [chunk-size overlap] :or {overlap 0}} text-units)
source

tokenclj

Text splitter by tokens. Tokenization is done based on the provided model.

Text splitter by tokens. Tokenization is done based on the provided model.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close