(->vocabulary-top-n bows n)
Takes top-n most frequent tokens as vocabulary
Takes top-n most frequent tokens as vocabulary
(bow->something-sparse ds bow-col indices-col bow->sparse-fn options)
Converts a bag-of-word column bow-col
to a sparse data column indices-col
.
The exact transformation to the sparse representtaion is given by bow->sparse-fn
Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`. The exact transformation to the sparse representtaion is given by `bow->sparse-fn`
(bow->sparse ds bow-col indices-col bow->sparse-fn vocabulary)
Generic function to convert a colmn to something sparse, using the given vocabulary.
Generic function to convert a colmn to something sparse, using the given vocabulary.
(bow->sparse-and-vocab ds
bow-col
indices-col
bow->sparse-fn
{:keys [create-vocab-fn]
:or {create-vocab-fn create-vocab-all}})
Converts a bag-of-word column bow-col
to a sparse data column indices-col
.
The exact transformation to the sparse representtaion is given by bow->sparse-fn
Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`. The exact transformation to the sparse representtaion is given by `bow->sparse-fn`
(bow->sparse-indices bow vocab->index-map)
Converts the token-frequencies to the sparse vectors needed by Maxent
Converts the token-frequencies to the sparse vectors needed by Maxent
(bow->tfidf ds bow-column tfidf-column)
(bow->tfidf ds bow-column tfidf-column options)
Calculates the tfidf score from bag-of-words (as token frequency maps)
in column bow-column
and stores them in a new column tfid-column
as maps of token->tfidf-score.
Possible options
:
:tf-map-handler-fn
: If present, it gets applied to the global term-frequency map after creating it.
Fn need to take map of terms to frequencies and return such map. Typical use is to prune less frequent terms.
Defaults to identity
, so all terms are retained.:tf-weighting-scheme
See function [[tf-idf]]:idf-weighting-scheme
See function [[tf-idf]]Calculates the tfidf score from bag-of-words (as token frequency maps) in column `bow-column` and stores them in a new column `tfid-column` as maps of token->tfidf-score. Possible `options`: - `:tf-map-handler-fn` : If present, it gets applied to the global term-frequency map after creating it. Fn need to take map of terms to frequencies and return such map. Typical use is to prune less frequent terms. Defaults to `identity`, so all terms are retained. - `:tf-weighting-scheme` See function [[tf-idf]] - `:idf-weighting-scheme` See function [[tf-idf]]
(count-vectorize ds text-col bow-col)
(count-vectorize
ds
text-col
bow-col
{:keys [text->bow-fn] :or {text->bow-fn default-text->bow} :as options})
Converts text column text-col
to bag-of-words representation
in the form of a frequency-count map.
The default text->bow function is default-text-bow
.
All options
are passed to it.
Converts text column `text-col` to bag-of-words representation in the form of a frequency-count map. The default text->bow function is `default-text-bow`. All `options` are passed to it.
(create-vocab-all bow)
Uses all tokens as the vocabulary
Uses all tokens as the vocabulary
(default-text->bow text)
(default-text->bow text options)
Converts text to token counts (a map token -> count).
Takes options:
stopwords
being either a keyword naming a
default Smile dictionary (:default :google :comprehensive :mysql)
or a seq of stop words. The stopwords get normalized in the same way
as the text itself, so it should contain full words
(non stemmed)
As default, no stopwords are used.
stemmer
being either :none or :porter for selecting the porter stemmer.
freq-handler-fn
A function taking a term-frequency map, and can further manipulate it.
Defaults to identity
Converts text to token counts (a map token -> count). Takes options: `stopwords` being either a keyword naming a default Smile dictionary (:default :google :comprehensive :mysql) or a seq of stop words. The stopwords get normalized in the same way as the text itself, so it should contain `full words` (non stemmed) As default, no stopwords are used. `stemmer` being either :none or :porter for selecting the porter stemmer. `freq-handler-fn` A function taking a term-frequency map, and can further manipulate it. Defaults to `identity`
(default-tokenize text)
(default-tokenize text options)
Tokenizes text. The usage of a stemmer can be configured by options :stemmer
Tokenizes text. The usage of a stemmer can be configured by options :stemmer
(freqs->SparseArray freq-map vocab->index-map)
Converts the token-frequency map to s smile SparseArray
Converts the token-frequency map to s smile SparseArray
(tf-map-handler-top-n n freqs)
Keeps the n most frequent terms in teh term-frequency table
Keeps the n most frequent terms in teh term-frequency table
(tfidf term bow bows)
(tfidf term bow bows options)
Calculates tfidf.
term
: The term for which to calculate the tfidf value
bow
: bag-of-words representation of the document (= term-frequency map)
bows
: list of bag-of-words representing the corpus (= list of term-frequency maps)
options
supported :
- :tf-weighting-scheme
The term-frequency weighting scheme with supported values :raw-count
, :term-frequency
Default is: :raw-count
see here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2
- :idf-weighting-scheme
The inverse term-frequency weighting scheme with supported values :smooth-sklearn
, :idf
, :smooth
Default is: smooth-sklearn
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer
Calculates tfidf. `term` : The term for which to calculate the tfidf value `bow` : bag-of-words representation of the document (= term-frequency map) `bows` : list of bag-of-words representing the corpus (= list of term-frequency maps) `options` supported : - `:tf-weighting-scheme` The term-frequency weighting scheme with supported values `:raw-count` , `:term-frequency` Default is: `:raw-count` see here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2 - `:idf-weighting-scheme` The inverse term-frequency weighting scheme with supported values `:smooth-sklearn`, `:idf`, `:smooth` Default is: `smooth-sklearn` https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer
(tfidf->dense-array ds tfidf-column array-column)
Converts the sparse tfidf map based representation into dense double arrays
Converts the sparse tfidf map based representation into dense double arrays
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close