scicloj.ml.smile.nlp

Liking cljdoc? Tell your friends :D

Clojure only.

->vocabulary-top-n^clj

(->vocabulary-top-n bows n)

Takes top-n most frequent tokens as vocabulary

Takes top-n most frequent tokens as vocabulary

raw docstring

bow->something-sparse^clj

(bow->something-sparse ds bow-col indices-col bow->sparse-fn options)

Converts a bag-of-word column bow-col to a sparse data column indices-col. The exact transformation to the sparse representtaion is given by bow->sparse-fn

Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`.
The exact transformation to the sparse representtaion is given by `bow->sparse-fn`

raw docstring

bow->sparse^clj

(bow->sparse ds bow-col indices-col bow->sparse-fn vocabulary)

Generic function to convert a colmn to something sparse, using the given vocabulary.

Generic function to convert a colmn to something sparse,
using the given vocabulary.

raw docstring

bow->sparse-and-vocab^clj

(bow->sparse-and-vocab ds
                       bow-col
                       indices-col
                       bow->sparse-fn
                       {:keys [create-vocab-fn]
                        :or {create-vocab-fn create-vocab-all}})

Converts a bag-of-word column bow-col to a sparse data column indices-col. The exact transformation to the sparse representtaion is given by bow->sparse-fn

Converts a bag-of-word column `bow-col` to a sparse data column `indices-col`.
The exact transformation to the sparse representtaion is given by `bow->sparse-fn`

raw docstring

bow->sparse-indices^clj

(bow->sparse-indices bow vocab->index-map)

Converts the token-frequencies to the sparse vectors needed by Maxent

Converts the token-frequencies to the sparse vectors
needed by Maxent

raw docstring

bow->tfidf^clj

(bow->tfidf ds bow-column tfidf-column)

(bow->tfidf ds bow-column tfidf-column options)

Calculates the tfidf score from bag-of-words (as token frequency maps) in column bow-column and stores them in a new column tfid-column as maps of token->tfidf-score. Possible options:

:tf-map-handler-fn : If present, it gets applied to the global term-frequency map after creating it. Fn need to take map of terms to frequencies and return such map. Typical use is to prune less frequent terms. Defaults to identity, so all terms are retained.
:tf-weighting-scheme See function [[tf-idf]]
:idf-weighting-scheme See function [[tf-idf]]

Calculates the tfidf score from bag-of-words (as token frequency maps)
 in column `bow-column` and stores them in a new column `tfid-column` as maps of token->tfidf-score.
Possible `options`:
- `:tf-map-handler-fn` : If present, it gets applied to the global term-frequency map after creating it.
   Fn need to take map of terms to frequencies and return such map. Typical use is to prune less frequent terms.
   Defaults to `identity`, so all terms are retained.
- `:tf-weighting-scheme` See function [[tf-idf]]
- `:idf-weighting-scheme` See function [[tf-idf]]

raw docstring

count-vectorize^clj

(count-vectorize ds text-col bow-col)

(count-vectorize
  ds
  text-col
  bow-col
  {:keys [text->bow-fn] :or {text->bow-fn default-text->bow} :as options})

Converts text column text-col to bag-of-words representation in the form of a frequency-count map. The default text->bow function is default-text-bow. All options are passed to it.

Converts text column `text-col` to bag-of-words representation
 in the form of a frequency-count map.
The default text->bow function is `default-text-bow`.
All `options` are passed to it.

raw docstring

create-vocab-all^clj

(create-vocab-all bow)

Uses all tokens as the vocabulary

Uses all tokens as the vocabulary

raw docstring

default-text->bow^clj

(default-text->bow text)

(default-text->bow text options)

Converts text to token counts (a map token -> count). Takes options: stopwords being either a keyword naming a default Smile dictionary (:default :google :comprehensive :mysql) or a seq of stop words. The stopwords get normalized in the same way as the text itself, so it should contain full words (non stemmed) As default, no stopwords are used. stemmer being either :none or :porter for selecting the porter stemmer. freq-handler-fn A function taking a term-frequency map, and can further manipulate it. Defaults to identity

Converts text to token counts (a map token -> count).
Takes options:
`stopwords` being either a keyword naming a
default Smile dictionary (:default :google :comprehensive :mysql)
or a seq of stop words. The stopwords get normalized in the same way
as the text itself, so it should contain `full words` (non stemmed)
As default, no stopwords are used.
`stemmer` being either :none or :porter for selecting the porter stemmer.
`freq-handler-fn` A function taking a term-frequency map, and can further manipulate it.
  Defaults to `identity`

raw docstring

default-tokenize^clj

(default-tokenize text)

(default-tokenize text options)

Tokenizes text. The usage of a stemmer can be configured by options :stemmer

Tokenizes text.
The usage of a stemmer can be configured by options :stemmer

raw docstring

default-word-normalize^clj

(default-word-normalize word)

freqs->SparseArray^clj

(freqs->SparseArray freq-map vocab->index-map)

Converts the token-frequency map to s smile SparseArray

Converts the token-frequency map to s smile SparseArray

raw docstring

idf^clj

(idf terms bows)

(idf term bows options)

resolve-stemmer^clj

(resolve-stemmer options)

resolve-stopwords^clj

(resolve-stopwords stopwords-option)

simple-normalizer^clj

tf^clj

(tf term bow)

(tf term bow options)

tf-map^clj

(tf-map bows)

tf-map-handler-top-n^clj

(tf-map-handler-top-n n freqs)

Keeps the n most frequent terms in teh term-frequency table

Keeps the n most frequent terms in teh term-frequency table

raw docstring

tfidf^clj

(tfidf term bow bows)

(tfidf term bow bows options)

Calculates tfidf. term : The term for which to calculate the tfidf value bow : bag-of-words representation of the document (= term-frequency map) bows : list of bag-of-words representing the corpus (= list of term-frequency maps)

options supported : - :tf-weighting-scheme The term-frequency weighting scheme with supported values :raw-count , :term-frequency Default is: :raw-count see here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2 - :idf-weighting-scheme The inverse term-frequency weighting scheme with supported values :smooth-sklearn, :idf, :smooth Default is: smooth-sklearn https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

Calculates tfidf.
`term` : The term for which to calculate the tfidf value
`bow`  : bag-of-words representation of the document (= term-frequency map)
`bows` : list of bag-of-words representing the corpus (= list of term-frequency maps)

`options` supported :
    - `:tf-weighting-scheme` The term-frequency weighting scheme with supported values `:raw-count` , `:term-frequency`
       Default is: `:raw-count`
       see here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Term_frequency_2
    - `:idf-weighting-scheme` The inverse term-frequency weighting scheme with supported values `:smooth-sklearn`, `:idf`, `:smooth`
       Default is: `smooth-sklearn`
       https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2
       https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer

raw docstring

tfidf->dense-array^clj

(tfidf->dense-array ds tfidf-column array-column)

Converts the sparse tfidf map based representation into dense double arrays

Converts the sparse tfidf map based representation into
dense double arrays

raw docstring

word-process^clj

(word-process stemmer word-normalizer-fn word options)

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close