Liking cljdoc? Tell your friends :D

consimilo.core


add-all-to-forestclj

(add-all-to-forest feature-coll)
(add-all-to-forest forest feature-coll)

Adds each vector in feature-coll to an lsh forest and returns the forest. If you want to add the feature-coll to an existing forest pass the forest as the first argument. Each item of feature-coll should be a map with :id and :features entries. The :id is the identifier for the minhash vector that will be returned upon query of the forest. This id can be utilized to lookup the minhash vector in the :keys hashmap of the forest. The :features is a collection of strings which will be utilized to create the minhash vector (e.g. in the case of a document, the :features could be tokens or n-grams).

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive sort called after items are added to the forest to enable ~log(n) queries.

Adds each vector in `feature-coll` to an lsh forest and returns the forest.
If you want to add the `feature-coll` to an existing `forest` pass the forest as the first argument.
Each item of `feature-coll` should be a map with :id and :features entries.
The :id is the identifier for the minhash vector that will be returned upon query of the forest.
This id can be utilized to lookup the minhash vector in the :keys hashmap of the forest.
The :features is a collection of strings which will be utilized to create the minhash vector
(e.g. in the case of a document, the :features could be tokens or n-grams).

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive
sort called after items are added to the forest to enable ~log(n) queries.
sourceraw docstring

add-files-to-forestclj

(add-files-to-forest files & {:keys [forest] :or {forest (f/new-forest)}})

Convenience method for processing files. Files should be a collection of File objects. The :id used for entry into the forest will be generated from the file name. The :features will be generated by extracting the text from each file and tokenizing and/or shingling per the optional parameters. The feature vector is minhashed and inserted into the lsh-forest.

Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive sort called after items are added to the forest to enable ~log(n) queries.

Convenience method for processing files. Files should be a collection of File objects.
The :id used for entry into the forest will be generated from the file name. The :features will
be generated by extracting the text from each file and tokenizing and/or shingling per the optional
parameters. The feature vector is minhashed and inserted into the lsh-forest.

Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive
sort called after items are added to the forest to enable ~log(n) queries.
sourceraw docstring

add-strings-to-forestclj

(add-strings-to-forest feature-coll
                       &
                       {:keys [forest] :or {forest (f/new-forest)}})

Convenience method for processing documents. Each item of feature-coll should be a map with :id and :features entries. The :id is the identifier for the minhash vector stored in the forest. The :features is a string which will be tokenized into features per the optional parameters. The feature vector will be minhashed and inserted into the lsh-forest.

Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive sort called after items are added to the forest to enable ~log(n) queries.

Convenience method for processing documents. Each item of feature-coll should be a map with
:id and :features entries. The :id is the identifier for the minhash vector stored in the forest.
The :features is a string which will be tokenized into features per the optional
parameters. The feature vector will be minhashed and inserted into the lsh-forest.

Optional Keyword Arguments: :forest - add to an existing forest; default: create new forest

Note: items should be loaded into the forest as few times as possible in large chunks. An expensive
sort called after items are added to the forest to enable ~log(n) queries.
sourceraw docstring

freeze-forestclj

(freeze-forest forest file-path)

Serializes forest and saves to a file. Forest should be created using one of the add-*-to-forest functions. file-path should be a string representing the filepath. Returns the byte-array representation of the serialize object and creates a file containing the byte-string representation of the serialized object.

Serializes forest and saves to a file. Forest should be created using one of the add-*-to-forest functions.
file-path should be a string representing the filepath. Returns the byte-array representation of the serialize
object and creates a file containing the byte-string representation of the serialized object.
sourceraw docstring

get-sim-fnclj

(get-sim-fn key)
source

query-fileclj

(query-file forest k file)

Convenience method for querying the forest for top-k similar files. Forest is the forest to be queried. File is converted to a feature vector through text-extraction, tokenizating / shingling per the optional arguments. The feature vector is minhashed and used to query the forest. k is the number of results (top-k most similar items).

Convenience method for querying the forest for top-k similar files. Forest is the forest to be
queried. File is converted to a feature vector through text-extraction, tokenizating / shingling
per the optional arguments. The feature vector is minhashed and used to query the forest. k is the number
of results (top-k most similar items).
sourceraw docstring

query-forestclj

(query-forest forest k v)

Finds the closest k vectors to vector v stored in the forest.

Finds the closest `k` vectors to vector `v` stored in the `forest`.
sourceraw docstring

query-stringclj

(query-string forest k string)

Convenience method for querying the forest for top-k similar strings. forest is the forest to be queried. string will be converted to a feature vector through tokenization / shingling per the optional parameters. The feature vector is minhashed and used to query the forest. K is the number of results (top-k most similar items).

Convenience method for querying the forest for top-k similar strings. forest is the forest to be
queried. string will be converted to a feature vector through tokenization / shingling per the optional
parameters. The feature vector is minhashed and used to query the forest. K is the number of results
(top-k most similar items).
sourceraw docstring

similarity-kcljmultimethod

Query forest for top-k items, returns a hashmap: {item-key1 sim-fn-result1 item-key-k sim-fn-result-k}. Available similarity functions are Jaccard similarity, cosine distance, and Hamming distance. sim-fn is defaulted to :jaccard, but can be overridden by passing the optional :sim-fn key and :jaccard, :cosine, or :hamming. similarity-k Dispatches based on input: string, file, or feature-vector.

Query forest for top-k items, returns a hashmap: {item-key1 sim-fn-result1 item-key-k sim-fn-result-k}. Available
similarity functions are Jaccard similarity, cosine distance, and Hamming distance. sim-fn is defaulted to :jaccard,
but can be overridden by passing the optional :sim-fn key and :jaccard, :cosine, or :hamming. similarity-k Dispatches
based on input: string, file, or feature-vector.
sourceraw docstring

thaw-forestclj

(thaw-forest file-path)

Deserializes forest from file. file-path should be a string representing the filepath of the serialized object. Returns an lsh-forest.

Deserializes forest from file. file-path should be a string representing the filepath of the serialized object.
Returns an lsh-forest.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close