clj-djl.dataframe.reductions

Liking cljdoc? Tell your friends :D

Clojure only.

aggregate-columns
count-distinct
distinct
distinct-int32
first-value
group-by-column-agg
mean
row-count
sum

aggregate-columns^clj

(aggregate-columns ds-or-seq colname agg-map & [options])

count-distinct^clj

(count-distinct colname)

(count-distinct colname op-space)

distinct^clj

(distinct colname)

(distinct colname finalizer)

Create a reducer that will return a set of values.

Create a reducer that will return a set of values.

raw docstring

distinct-int32^clj

(distinct-int32 colname)

(distinct-int32 colname finalizer)

Get the set of distinct items given you know the space is no larger than int32 space. The optional finalizer allows you to post-process the data.

Get the set of distinct items given you know the space is no larger than int32
space.  The optional finalizer allows you to post-process the data.

raw docstring

first-value^clj

(first-value colname)

group-by-column-agg^clj

(group-by-column-agg colname agg-map ds-seq)

(group-by-column-agg colname agg-map options ds-seq)

Group a sequence of datasets by a column and aggregate down into a new dataset.

colname - Either a single scalar column name or a vector of column names to group by.
agg-map - map of result column name to reducer. All values in the agg map must be instances of tech.v3.datatype.IndexReduction. Column values will be inferred from the finalized result of the first reduction with nil indicating an object column.

Options:

:map-initial-capacity - initial hashmap capacity. Resizing hash-maps is expensive so we would like to set this to something reasonable. Defaults to 100000.
:index-filter - A function that given a dataset produces a function from long index to boolean. Only indexes for which the index-filter returns true will be added to the aggregation. For very large datasets, this is a bit faster than using filter before the aggregation.

Example:

user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds-reduce/group-by-column-agg
       :symbol
       {:symbol (ds-reduce/first-value :symbol)
        :price-avg (ds-reduce/mean :price)
        :price-sum (ds-reduce/sum :price)}
       [stocks stocks stocks])
:symbol-aggregation [5 3]:

| :symbol |   :price-avg | :price-sum |
|---------|--------------|------------|
|    MSFT |  24.73674797 |    9127.86 |
|     IBM |  91.26121951 |   33675.39 |
|    AAPL |  64.73048780 |   23885.55 |
|    GOOG | 415.87044118 |   84837.57 |
|    AMZN |  47.98707317 |   17707.23 |



tech.v3.dataset.reductions-test> (def tstds
                                   (ds/->dataset {:a ["a" "a" "a" "b" "b" "b" "c" "d" "e"]
                                                  :b [22   21  22 44  42  44   77 88 99]}))
#'tech.v3.dataset.reductions-test/tstds
tech.v3.dataset.reductions-test>  (ds-reduce/group-by-column-agg
                                   [:a :b] {:a (ds-reduce/first-value :a)
                                            :b (ds-reduce/first-value :b)
                                            :c (ds-reduce/row-count)}
                                   [tstds tstds tstds])
:tech.v3.dataset.reductions/_temp_col-aggregation [7 3]:

| :a | :b | :c |
|----|---:|---:|
|  a | 21 |  3 |
|  a | 22 |  6 |
|  b | 42 |  3 |
|  b | 44 |  6 |
|  c | 77 |  3 |
|  d | 88 |  3 |
|  e | 99 |  3 |

Group a sequence of datasets by a column and aggregate down into a new dataset.

  * colname - Either a single scalar column name or a vector of column names to group by.

  * agg-map - map of result column name to reducer.  All values in the agg map must be
    instances of `tech.v3.datatype.IndexReduction`.  Column values will be inferred from
    the finalized result of the first reduction with nil indicating an object column.

  Options:

  * `:map-initial-capacity` - initial hashmap capacity.  Resizing hash-maps is expensive so we
     would like to set this to something reasonable.  Defaults to 100000.
  * `:index-filter` - A function that given a dataset produces a function from long index
    to boolean.  Only indexes for which the index-filter returns true will be added to the
    aggregation.  For very large datasets, this is a bit faster than using filter before
    the aggregation.

  Example:

```clojure
user> (require '[tech.v3.dataset :as ds])
nil
user> (require '[tech.v3.dataset.reductions :as ds-reduce])
nil
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds-reduce/group-by-column-agg
       :symbol
       {:symbol (ds-reduce/first-value :symbol)
        :price-avg (ds-reduce/mean :price)
        :price-sum (ds-reduce/sum :price)}
       [stocks stocks stocks])
:symbol-aggregation [5 3]:

| :symbol |   :price-avg | :price-sum |
|---------|--------------|------------|
|    MSFT |  24.73674797 |    9127.86 |
|     IBM |  91.26121951 |   33675.39 |
|    AAPL |  64.73048780 |   23885.55 |
|    GOOG | 415.87044118 |   84837.57 |
|    AMZN |  47.98707317 |   17707.23 |



tech.v3.dataset.reductions-test> (def tstds
                                   (ds/->dataset {:a ["a" "a" "a" "b" "b" "b" "c" "d" "e"]
                                                  :b [22   21  22 44  42  44   77 88 99]}))
#'tech.v3.dataset.reductions-test/tstds
tech.v3.dataset.reductions-test>  (ds-reduce/group-by-column-agg
                                   [:a :b] {:a (ds-reduce/first-value :a)
                                            :b (ds-reduce/first-value :b)
                                            :c (ds-reduce/row-count)}
                                   [tstds tstds tstds])
:tech.v3.dataset.reductions/_temp_col-aggregation [7 3]:

| :a | :b | :c |
|----|---:|---:|
|  a | 21 |  3 |
|  a | 22 |  6 |
|  b | 42 |  3 |
|  b | 44 |  6 |
|  c | 77 |  3 |
|  d | 88 |  3 |
|  e | 99 |  3 |
```

raw docstring

mean^clj

(mean colname)

Create a double consumer which will produce a mean of the column.

Create a double consumer which will produce a mean of the column.

raw docstring

row-count^clj

(row-count)

Create a simple reducer that returns the number of times reduceIndex was called.

Create a simple reducer that returns the number of times reduceIndex was called.

raw docstring

sum^clj

(sum colname)

Create a double consumer which will sum the values.

Create a double consumer which will sum the values.

raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close