Liking cljdoc? Tell your friends :D

datalevin.search-utils

Some useful utility functions that can be passed as options to search engine to customize search.

Some useful utility functions that can be passed as options to search
engine to customize search.
raw docstring

create-analyzerclj

(create-analyzer {:keys [tokenizer token-filters]
                  :or {tokenizer default-tokenizer token-filters []}})

Creates an analyzer fn ready for use in search.

opts have the following keys:

  • :tokenizer is a tokenizing fn that takes a string and returns a seq of [term, position, offset], where term is a word, position is the sequence number of the term, and offset is the character offset of this term. e.g. create-regexp-tokenizer produces such fn.

  • :token-filters is an ordered list of token filters. A token filter receives a [term, position, offset] and returns a transformed list of tokens to replace it with.

Creates an analyzer fn ready for use in search.

`opts` have the following keys:

* `:tokenizer` is a tokenizing fn that takes a string and returns a seq of
[term, position, offset], where term is a word, position is the sequence
number of the term, and offset is the character offset of this term.
e.g. `create-regexp-tokenizer` produces such fn.

* `:token-filters` is an ordered list of token filters. A token filter
receives a [term, position, offset] and returns a transformed list of
tokens to replace it with.
sourceraw docstring

create-max-length-token-filterclj

(create-max-length-token-filter max-length)

Filters tokens that are strictly longer than max-length.

Filters tokens that are strictly longer than `max-length`.
sourceraw docstring

create-min-length-token-filterclj

(create-min-length-token-filter min-length)

Filters tokens that are strictly shorter than min-length.

Filters tokens that are strictly shorter than `min-length`.
sourceraw docstring

create-ngram-token-filterclj

(create-ngram-token-filter gram-size)
(create-ngram-token-filter min-gram-size max-gram-size)

Produces character ngrams between min and max size from the token and returns everything as tokens. This is useful for producing efficient fuzzy search.

Produces character ngrams between min and max size from the token and returns
everything as tokens. This is useful for producing efficient fuzzy search.
sourceraw docstring

create-regexp-tokenizerclj

(create-regexp-tokenizer pat)

Creates a tokenizer that splits text on the given regular expression pattern pat.

Creates a tokenizer that splits text on the given regular expression
pattern `pat`.
sourceraw docstring

create-stemming-token-filterclj

(create-stemming-token-filter language)

Create a token filter that replaces tokens with their stems.

The stemming algorithm is Snowball https://snowballstem.org/

language is a string, its value can be one of the following:

arabic, armenian, basque, catalan, danish, dutch, english, french, finnish, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, swedish, tamil, turkish, spanish, yiddish, and porter

Create a token filter that replaces tokens with their stems.

The stemming algorithm is Snowball https://snowballstem.org/

`language` is a string, its value can be one of the following:

arabic, armenian, basque, catalan, danish, dutch, english, french,
finnish, german, greek, hindi, hungarian, indonesian, irish, italian,
lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian,
swedish, tamil, turkish, spanish, yiddish, and porter
sourceraw docstring

create-stop-words-token-filterclj

(create-stop-words-token-filter stop-word-pred)

Takes a stop words predicate that returns true when the given token is a stop word

Takes a stop words predicate that returns `true` when the given token is
a stop word
sourceraw docstring

default-tokenizerclj

source

en-stop-words-token-filterclj

This token filter removes "empty" tokens (for english language).

This token filter removes "empty" tokens (for english language).
sourceraw docstring

lower-case-token-filterclj

This token filter converts tokens to lower case.

This token filter converts tokens to lower case.
sourceraw docstring

prefix-token-filterclj

Produces a series of every possible prefixes in a token and replace it with them.

For example: vault -> v, va, vau, vaul, vault

This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time.

Produces a series of every possible prefixes in a token and replace it with them.

For example: vault -> v, va, vau, vaul, vault

This is useful for producing efficient autocomplete engines, provided this
filter is NOT applied at query time.
sourceraw docstring

unaccent-token-filterclj

This token filter removes accents and diacritics from tokens.

This token filter removes accents and diacritics from tokens.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close