josh.meanings — org.clojars.joshua/josh.meanings 1.0.0

josh.meanings.classify

K-Means clustering generates a specific number of disjoint, non-hierarchical clusters. It is well suited to generating globular clusters. The K-Means method is numerical, unsupervised, non-deterministic and iterative. Every member of a cluster is closer to its cluster center than the center of any other cluster.

The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion. As a result k means is best run multiple times in order to avoid the trap of a local minimum.

K-Means clustering generates a specific number of disjoint, 
non-hierarchical clusters. It is well suited to generating globular
clusters. The K-Means method is numerical, unsupervised, 
non-deterministic and iterative. Every member of a cluster is closer 
to its cluster center than the center of any other cluster.

The choice of initial partition can greatly affect the final clusters 
that result, in terms of inter-cluster and intracluster distances and 
cohesion. As a result k means is best run multiple times in order to 
avoid the trap of a local minimum.

raw docstring

josh.meanings.distances

Multimethod for distance calculations.

The get-distance-fn multimethod dispatches based on identity to determine what distance function to use when calculating distances. By default the supported distance function keys are:

:euclidean
:manhattan
:chebyshev
:correlation
:canberra
:emd
:euclidean-sq
:discrete
:cosine
:angular
:jensen-shannon

The default distance function is :emd which may not be appropriate for all use cases since it doesn't minimize the variational distance like euclidean would. If you don't know why :emd is the default you should probably switch to using :euclidean.

Multimethod for distance calculations.

The `get-distance-fn` multimethod dispatches based on identity to 
determine what distance function to use when calculating distances. 
By default the supported distance function keys are:

- :euclidean
- :manhattan
- :chebyshev
- :correlation
- :canberra
- :emd
- :euclidean-sq
- :discrete
- :cosine
- :angular
- :jensen-shannon

The default distance function is :emd which may not be appropriate for all
use cases since it doesn't minimize the variational distance like euclidean
would.  If you don't know why :emd is the default you should probably switch 
to using :euclidean.

raw docstring

josh.meanings.initializations.afk

Fast and Provably Good Seedings for k-Means is a paper by Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause which introduces an improvement to the monte carlo markov chain approximation of k-means++ D^2 sampling. It accomplishes this by computing the D^2 sampling distribution with respect to the first cluster. This has the practical benefit of removing some of the assumptions, like choice of distance metric, which were imposed in the former framing. As such the name of this algorithm is assumption free k-mc^2. A savvy reader may note that by computing the D^2 sampling distribution as part of the steps this algorithm loses some of the theoretical advantages of the pure markov chain formulation. The paper argues that this is acceptable, because in practice computing the first D^2 sampling distribution ends up paying for itself by reducing the chain length necessary to get convergence guarantees.

Fast and Provably Good Seedings for k-Means is a paper by Olivier Bachem, 
Mario Lucic, S. Hamed Hassani, and Andreas Krause which introduces an 
improvement to the monte carlo markov chain approximation of k-means++ 
D^2 sampling. It accomplishes this by computing the D^2 sampling 
distribution with respect to the first cluster. This has the practical 
benefit of removing some of the assumptions, like choice of distance 
metric, which were imposed in the former framing. As such the name of 
this algorithm is assumption free k-mc^2. A savvy reader may note that 
by computing the D^2 sampling distribution as part of the steps this 
algorithm loses some of the theoretical advantages of the pure markov 
chain formulation. The paper argues that this is acceptable, because 
in practice computing the first D^2 sampling distribution ends up paying 
for itself by reducing the chain length necessary to get convergence 
guarantees.

raw docstring

josh.meanings.initializations.core

josh.meanings.initializations.mc2

Approximating K-Means in sublinear time is a paper written by Olivier Bachmem, Mario Lucic, Hamed Hassani, and Andreas Krause which shares a method for obtaining a provably good approximation of k-means++ in sublinear time. The method they share uses markov chain monte carlo sampling in order to approximate the D^2 sampling that is used in k-means++. Since this method is proven to converge to drawing from the same distribution as D^2 sampling in k-means++ the theoretical competitiveness guarantees of k-means++ are inherited. This algorithm is sublinear with respect to input size which makes it different from other variants of k-means++ like k-means||. Whereas a variant like k-means|| allows for a distributed k-means++ computation to be carried out across a cluster of computers, k-means-mc++ is better suited to running on a single machine.

Approximating K-Means in sublinear time is a paper written by 
Olivier Bachmem, Mario Lucic, Hamed Hassani, and Andreas Krause 
which shares a method for obtaining a provably good approximation
of k-means++ in sublinear time. The method they share uses markov 
chain monte carlo sampling in order to approximate the D^2 sampling 
that is used in k-means++. Since this method is proven to converge to 
drawing from the same distribution as D^2 sampling in k-means++ the 
theoretical competitiveness guarantees of k-means++ are inherited. 
This algorithm is sublinear with respect to input size which makes 
it different from other variants of k-means++ like k-means||. Whereas
a variant like k-means|| allows for a distributed k-means++ computation 
to be carried out across a cluster of computers, k-means-mc++ is 
better suited to running on a single machine.

raw docstring

josh.meanings.initializations.niave

A random initialization strategy for k means which lacks theoretical guarantees on solution quality for any individual run, but which will complete in O(n + kd) time and only takes O(kd) space.

A random initialization strategy for k means which lacks theoretical 
guarantees on solution quality for any individual run, but which will 
complete in O(n + k*d) time and only takes O(k*d) space.

raw docstring

josh.meanings.initializations.parallel

josh.meanings.initializations.plusplus

josh.meanings.initializations.utils

josh.meanings.kmeans

K-Means clustering generates a specific number of disjoint, non-hierarchical clusters. It is well suited to generating globular clusters. The K-Means method is numerical, unsupervised, non-deterministic and iterative. Every member of a cluster is closer to its cluster center than the center of any other cluster.

The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion. As a result k means is best run multiple times in order to avoid the trap of a local minimum.

K-Means clustering generates a specific number of disjoint, 
non-hierarchical clusters. It is well suited to generating globular
clusters. The K-Means method is numerical, unsupervised, 
non-deterministic and iterative. Every member of a cluster is closer 
to its cluster center than the center of any other cluster.

The choice of initial partition can greatly affect the final clusters 
that result, in terms of inter-cluster and intracluster distances and 
cohesion. As a result k means is best run multiple times in order to 
avoid the trap of a local minimum.

raw docstring

josh.meanings.persistence

A namespace for functions related to reading and writing datasets to various file formats.

This namespace provides functions for loading datasets from files, writing datasets to files, and converting between different file formats. It also provides a configuration map of supported file formats and their associated reader and writer functions.

Examples:

(read-dataset-seq state :filename)

(write-dataset-seq state :filename [dataset1 dataset2 ...])

A namespace for functions related to reading and writing datasets to various file formats.

This namespace provides functions for loading datasets from files, writing datasets to files,
and converting between different file formats. It also provides a configuration map of
supported file formats and their associated reader and writer functions.

Examples:

(read-dataset-seq state :filename)
 
(write-dataset-seq state :filename [dataset1 dataset2 ...])

raw docstring

josh.meanings.protocols.classifier

Classifier

josh.meanings.protocols.cluster-model

The josh.meanings.protocols.cluster-model namespace defines a protocol for manipulating cluster models that have already been trained.

The PClusterModel protocol defines the interface for cluster model implementations, including methods for saving and loading the model data, and classifying points using the model.

Any implementation of the PClusterModel protocol must provide concrete implementations of the protocol methods, and adhere to the input and output specs defined in the protocol.

The `josh.meanings.protocols.cluster-model` namespace defines a protocol for
manipulating cluster models that have already been trained. 

The `PClusterModel` protocol defines the interface for cluster model 
implementations, including methods for saving and loading the model data, and 
classifying points using the model.

Any implementation of the `PClusterModel` protocol must provide concrete 
implementations of the protocol methods, and adhere to the input and output 
specs defined in the protocol.

raw docstring

PClusterModel

josh.meanings.protocols.clustering-state

PClusteringState

josh.meanings.protocols.savable

Savable

josh.meanings.records.cluster-result

Cluster results are the result of a clustering operationg. They contain a reference to the models centroids, the assignments which generated those centroids, the objective function cost of the clustering, the format that the cluster is saved in, and some configuration details about how the clustering process was run.

Callers who wish to use the cluster result can get access to the centroids with:

(.load-centroids cluster-result)

They can get access to the assignments with:

(.load-assignments-datasets cluster-result)

They can also cluster points with the cluster result with:

(.classify cluster-result x)

Where x is a vector of points in (-> cluster-result :configuration :col-names) order or a map that contains every col-name as a key.

To save a cluster result to disk

(.save-model cluster-result filename)

To load a cluster result from disk

(load-model filename)

Cluster results are the result of a clustering operationg.  They contain 
a reference to the models centroids, the assignments which generated those 
centroids, the objective function cost of the clustering, the format that 
the cluster is saved in, and some configuration details about how the clustering 
process was run.

Callers who wish to use the cluster result can get access to the centroids with:

```
(.load-centroids cluster-result)
```

They can get access to the assignments with:

```
(.load-assignments-datasets cluster-result)
```

They can also cluster points with the cluster result with:

```
(.classify cluster-result x)
```

Where x is a vector of points in `(-> cluster-result :configuration :col-names)` order or a map
that contains every col-name as a key.

To save a cluster result to disk

```
(.save-model cluster-result filename)
```

To load a cluster result from disk

```
(load-model filename)
```

raw docstring

josh.meanings.records.clustering-state

Provides a defrecord for storing the configuration of a clustering process and a protocol for retreiving stateful IO like the potentially larger than memory points and assignments datasets.

Provides a defrecord for storing the configuration of a clustering process 
and a protocol for retreiving stateful IO like the potentially larger than 
memory points and assignments datasets.

raw docstring

No vars found in this namespace.

josh.meanings.simplifying

Simplification in this context means transforming a dataset that has duplicate values to one in which we there are no duplicates and one new column - the number of duplicates that were observed.

Simplification in this context means transforming a 
dataset that has duplicate values to one in which we 
there are no duplicates and one new column - the number 
of duplicates that were observed.

raw docstring

ds-seq->frequencies

josh.meanings.classify

josh.meanings.distances

josh.meanings.initializations.afk

josh.meanings.initializations.core

josh.meanings.initializations.mc2

josh.meanings.initializations.niave

josh.meanings.initializations.parallel

josh.meanings.initializations.plusplus

josh.meanings.initializations.utils

josh.meanings.kmeans

josh.meanings.persistence

josh.meanings.protocols.classifier

josh.meanings.protocols.cluster-model

josh.meanings.protocols.clustering-state

josh.meanings.protocols.savable

josh.meanings.records.cluster-result

josh.meanings.records.clustering-state

josh.meanings.simplifying

josh.meanings.specs

josh.meanings.testbeds.core

josh.meanings.testbeds.gaussian-hyperspheres