datalevin.search-utils

Liking cljdoc? Tell your friends :D

Clojure only.

create-analyzer
create-max-length-token-filter
create-min-length-token-filter
create-ngram-token-filter
create-regexp-tokenizer
create-stemming-token-filter
create-stop-words-token-filter
default-tokenizer
en-stop-words-token-filter
lower-case-token-filter
prefix-token-filter
unaccent-token-filter

Some useful utility functions that can be passed as options to search engine to customize search.

Some useful utility functions that can be passed as options to search
engine to customize search.

raw docstring

create-analyzer^clj

(create-analyzer {:keys [tokenizer token-filters]
                  :or {tokenizer default-tokenizer token-filters []}})

Creates an analyzer fn ready for use in search.

opts have the following keys:

:tokenizer is a tokenizing fn that takes a string and returns a seq of [term, position, offset], where term is a word, position is the sequence number of the term, and offset is the character offset of this term. e.g. create-regexp-tokenizer produces such fn.
:token-filters is an ordered list of token filters. A token filter receives a [term, position, offset] and returns a transformed list of tokens to replace it with.

Creates an analyzer fn ready for use in search.

`opts` have the following keys:

* `:tokenizer` is a tokenizing fn that takes a string and returns a seq of
[term, position, offset], where term is a word, position is the sequence
number of the term, and offset is the character offset of this term.
e.g. `create-regexp-tokenizer` produces such fn.

* `:token-filters` is an ordered list of token filters. A token filter
receives a [term, position, offset] and returns a transformed list of
tokens to replace it with.

source raw docstring

create-max-length-token-filter^clj

(create-max-length-token-filter max-length)

Filters tokens that are strictly longer than max-length.

Filters tokens that are strictly longer than `max-length`.

source raw docstring

create-min-length-token-filter^clj

(create-min-length-token-filter min-length)

Filters tokens that are strictly shorter than min-length.

Filters tokens that are strictly shorter than `min-length`.

source raw docstring

create-ngram-token-filter^clj

(create-ngram-token-filter gram-size)

(create-ngram-token-filter min-gram-size max-gram-size)

Produces character ngrams between min and max size from the token and returns everything as tokens. This is useful for producing efficient fuzzy search.

Produces character ngrams between min and max size from the token and returns
everything as tokens. This is useful for producing efficient fuzzy search.

source raw docstring

create-regexp-tokenizer^clj

(create-regexp-tokenizer pat)

Creates a tokenizer that splits text on the given regular expression pattern pat.

Creates a tokenizer that splits text on the given regular expression
pattern `pat`.

source raw docstring

create-stemming-token-filter^clj

(create-stemming-token-filter language)

Create a token filter that replaces tokens with their stems.

The stemming algorithm is Snowball https://snowballstem.org/

language is a string, its value can be one of the following:

arabic, armenian, basque, catalan, danish, dutch, english, french, finnish, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, swedish, tamil, turkish, spanish, yiddish, and porter

Create a token filter that replaces tokens with their stems.

The stemming algorithm is Snowball https://snowballstem.org/

`language` is a string, its value can be one of the following:

arabic, armenian, basque, catalan, danish, dutch, english, french,
finnish, german, greek, hindi, hungarian, indonesian, irish, italian,
lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian,
swedish, tamil, turkish, spanish, yiddish, and porter

source raw docstring

create-stop-words-token-filter^clj

(create-stop-words-token-filter stop-word-pred)

Takes a stop words predicate that returns true when the given token is a stop word

Takes a stop words predicate that returns `true` when the given token is
a stop word

source raw docstring

default-tokenizer^clj

source

en-stop-words-token-filter^clj

This token filter removes "empty" tokens (for english language).

This token filter removes "empty" tokens (for english language).

source raw docstring

lower-case-token-filter^clj

This token filter converts tokens to lower case.

This token filter converts tokens to lower case.

source raw docstring

prefix-token-filter^clj

Produces a series of every possible prefixes in a token and replace it with them.

For example: vault -> v, va, vau, vaul, vault

This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time.

Produces a series of every possible prefixes in a token and replace it with them.

For example: vault -> v, va, vau, vaul, vault

This is useful for producing efficient autocomplete engines, provided this
filter is NOT applied at query time.

source raw docstring

unaccent-token-filter^clj

This token filter removes accents and diacritics from tokens.

This token filter removes accents and diacritics from tokens.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close