bosquet.nlp.splitter

Liking cljdoc? Tell your friends :D

Clojure only.

character
chunk-size
chunk-text
model
overlap
sentence
split-handlers
split-unit
text-splitter
token

character^clj

Text splitter by individual characters.

Text splitter by individual characters.

source raw docstring

chunk-size^clj

Since of the chunks in which the text gets split.

Since of the chunks in which the text gets split.

source raw docstring

chunk-text^clj

(chunk-text {:splitter/keys [unit] :as opts} text)

Chunk text into chunk-size blocks using specified splitter. Optionaly overlap can be specified by how many text units chunks can overap (defaults to 0).

Supported text splitters:

sentence-splitter
character-splitter
token-splitter

Chunk `text` into `chunk-size` blocks using specified `splitter`. Optionaly
`overlap` can be specified by how many text units chunks can overap (defaults to 0).

Supported text splitters:
- `sentence-splitter`
- `character-splitter`
- `token-splitter`

source raw docstring

model^clj

Model used for tokenization.

Model used for tokenization.

source raw docstring

overlap^clj

Number of units by which chunks can overlap.

Number of units by which chunks can overlap.

source raw docstring

sentence^clj

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition the text.

Text splitter by sentences. It will use OpenNLP sentnce splitter to partition
the text.

source raw docstring

split-handlers^clj

Split handlers are needed to turn text into specified text units via encode function. decode function will turn those units back into single text string.

Split handlers are needed to turn text into specified text units via `encode` function.
`decode` function will turn those units back into single text string.

source raw docstring

split-unit^clj

Lexical units in which the text gets split: character, token, sentence.

Lexical units in which the text gets split: character, token, sentence.

source raw docstring

text-splitter^clj

(text-splitter {:splitter/keys [chunk-size overlap] :or {overlap 0}} text-units)

source

token^clj

Text splitter by tokens. Tokenization is done based on the provided model.

Text splitter by tokens. Tokenization is done based on the provided model.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close