(band-hash band-size minhash-list)
Takes the minhash signature of a string and partitions it according to band-size
Then we hash each "band" (partition) as similar strings will tend have at least one matching hashed band
Takes the minhash signature of a string and partitions it according to `band-size` Then we hash each "band" (partition) as similar strings will tend have at least one matching hashed band
(compare-records records)
Compares a list of records/string with each other using org.clojars.punit-naik.clj-ml.utils.string/reversed-levenstein-distance
Compares a list of records/string with each other using `org.clojars.punit-naik.clj-ml.utils.string/reversed-levenstein-distance`
(find-possible-duplicates shingle-size
hash-count
band-size
match-threshold
data)
Takes a collection of strings (data
) and finds out the similar strings from the collection
Takes a collection of strings (`data`) and finds out the similar strings from the collection
(hash-n-times sh-list n)
Hashes a shingles list n
times
Hashes a shingles list `n` times
(min-hash hash-values)
Takes the lists of hashed values (where all of them have the same size) and finds the minimum hash value at the position ‘i’ from every list thereby generating a single list of hash values which is the minhash signature of that string
Takes the lists of hashed values (where all of them have the same size) and finds the minimum hash value at the position ‘i’ from every list thereby generating a single list of hash values which is the minhash signature of that string
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close