Some useful utility functions that can be passed as options to search engine to customize search.
Some useful utility functions that can be passed as options to search engine to customize search.
(create-analyzer opts)
Creates an analyzer fn ready for use in search.
opts
have the following keys:
:tokenizer
is a tokenizing fn that takes a string and returns a seq of
[term, position, offset], where term is a word, position is the sequence
number of the term, and offset is the character offset of this term.
e.g. create-regexp-tokenizer
produces such fn.
:token-filters
is an ordered list of token filters. A token filter
receives a [term, position, offset] and returns a transformed list of
tokens to replace it with.
Creates an analyzer fn ready for use in search. `opts` have the following keys: * `:tokenizer` is a tokenizing fn that takes a string and returns a seq of [term, position, offset], where term is a word, position is the sequence number of the term, and offset is the character offset of this term. e.g. `create-regexp-tokenizer` produces such fn. * `:token-filters` is an ordered list of token filters. A token filter receives a [term, position, offset] and returns a transformed list of tokens to replace it with.
(create-max-length-token-filter max-length)
Filters tokens that are strictly longer than max-length
.
Filters tokens that are strictly longer than `max-length`.
(create-min-length-token-filter min-length)
Filters tokens that are strictly shorter than min-length
.
Filters tokens that are strictly shorter than `min-length`.
(create-ngram-token-filter gram-size)
(create-ngram-token-filter min-gram-size max-gram-size)
Produces character ngrams between min and max size from the token and returns everything as tokens. This is useful for producing efficient fuzzy search.
Produces character ngrams between min and max size from the token and returns everything as tokens. This is useful for producing efficient fuzzy search.
(create-regexp-tokenizer pat)
Creates a tokenizer that splits text on the given regular expression
pat
.
Creates a tokenizer that splits text on the given regular expression `pat`.
(create-stemming-token-filter language)
Create a token filter that replaces tokens with their stems.
The stemming algorithm is Snowball https://snowballstem.org/
language
is a string, its value can be one of the following:
arabic, armenian, basque, catalan, danish, dutch, english, french, finnish, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, swedish, tamil, turkish, spanish, yiddish, and porter
Create a token filter that replaces tokens with their stems. The stemming algorithm is Snowball https://snowballstem.org/ `language` is a string, its value can be one of the following: arabic, armenian, basque, catalan, danish, dutch, english, french, finnish, german, greek, hindi, hungarian, indonesian, irish, italian, lithuanian, nepali, norwegian, portuguese, romanian, russian, serbian, swedish, tamil, turkish, spanish, yiddish, and porter
(create-stop-words-token-filter stop-word-pred)
Takes a stop words predicate that returns true
when the given token is
a stop word
Takes a stop words predicate that returns `true` when the given token is a stop word
This token filter removes "empty" tokens (for english language).
This token filter removes "empty" tokens (for english language).
This token filter converts tokens to lower case.
This token filter converts tokens to lower case.
Produces a series of every possible prefixes in a token and replace it with them. For example: vault -> v, va, vau, vaul, vault
Takes a vector [word position start-offset]
.
This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time.
Produces a series of every possible prefixes in a token and replace it with them. For example: vault -> v, va, vau, vaul, vault Takes a vector `[word position start-offset]`. This is useful for producing efficient autocomplete engines, provided this filter is NOT applied at query time.
This token filter removes accents and diacritics from tokens.
This token filter removes accents and diacritics from tokens.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close