Liking cljdoc? Tell your friends :D

Clojure's counterparts of common R data abstractions

Several Clojure libraries have offered counterparts of common R data abstractions.

Here we'll mention a few of those libraries.

Incanter

Incanter by D.E. Liebke (later maintained by A. Ott), was the first famous Clojure library for data science. It offers notions of datasets (like R data frames), matrices and categorical variables (like R factors). Data frames are called 'datasets' in Incanter. Of course, it does more than that, willing to be Clojure's counterpart of R itself.

Rincanter, and its forks, supported Incanter's abstractions as the counterparts of R's dataframes.

Indenpendently of Incanter, some of Rincanter's forks also support a notion of categorical variables as a counterpart of R's factors.

core.matrix

core.matrix by Mike Anderson is mainly a matrix library. It offers a set of abstractions that has several implementations. In one of its late versions, it started supporting a notion of a dataset that generalizes Incanter's notion. This was done in cooperation with Incanter's developers, so that Incanter's dataset implementation could be replaced by core.matrix, and thus be able to run on any core.matrix implementation.

As mention above, Rojure uses the dataset abstraction of core.matrix, and thus supports Incanter, too.

Neanderthal and Denisovan

Neanderthal by Dragan Djuric is one of the popular Clojure matrix libraries, and surely the most active and comprehensive one nowadays. It focuses on high-performance computation.

Denisovan is a partial implementation of core.matrix that uses Neanderthal as its engine.

Spork's Incanter forks and extensions

The Spork project by Joinr is a collection of libraries for data science and operations researck. Among other things, it continues the work of Incanter and offers its own notion of typed columnar table. See some comments here.

The Tech stack

The 'tech' collection of libraries by Chris Nuernberger offers a new set of abstractions relevant for data science work. One of the main ideas behind this stack is to build bridges rather than islands - that is, the goal is not to create a specific toolset, but rather to create a platform that can connect to any other relevant toolset, thus enjoying the growth and development of any relevant ecosystem in the field.

Two relevant libraries of this stack are the following:

  • tech.datatype library offers a set of abstractions for working with various sequential and array-like structures (including tensors of arbitrary dimension, and in particular, matrices).
  • tech.ml.dataset grew out of some discussion following the above-mentioned work of Spork, and other relevant experiences. It offers a 'dataset' abstraction (data frame, in R's language) of typed-columns tabular structires, that allows for different implementations. It has one specific comprehensive implementation based on the Tablesaw java library. It also offers its own notion of categorical variables (similar to R's factors).

Other libraries offering a data-frame-like notions

Here are some other Clojure libraries that offer some data-frame-like notions. Some of them are strongly inspired by the Python Pandas library.

  • dataframe by George Lewis follows closely the main API functions Pandas.
  • koala by Aria Haghighi also implements pandas-like functions, and has some nice IO-support.
  • wombat by Ribelo is inspired by Pandas, and offers a transducers-based API and implementation.
  • panthera by Alan Marazzi actually wraps Pandas through Libpython-clj.
  • huri by Simon Belak offers data-frame-like experience by processing simple sequences-of-maps.
  • kixi.stats by Henri Garner, Simon Belak and Elise Huard offers a large composable toolset of statistical functionality on sequences-of-maps, based on transducers.

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close