Liking cljdoc? Tell your friends :D

tech-ml-version

“7.021”

tablecloth-version

“7.021”

Introduction

tech.ml.dataset is a great and fast library which brings columnar dataset to the Clojure. Chris Nuernberger has been working on this library for last year as a part of bigger tech.ml stack.

I’ve started to test the library and help to fix uncovered bugs. My main goal was to compare functionalities with the other standards from other platforms. I focused on R solutions: dplyr, tidyr and data.table.

During conversions of the examples I’ve come up how to reorganized existing tech.ml.dataset functions into simple to use API. The main goals were:

  • Focus on dataset manipulation functionality, leaving other parts of tech.ml like pipelines, datatypes, readers, ML, etc.
  • Single entry point for common operations - one function dispatching on given arguments.
  • group-by results with special kind of dataset - a dataset containing subsets created after grouping as a column.
  • Most operations recognize regular dataset and grouped dataset and process data accordingly.
  • One function form to enable thread-first on dataset.

If you want to know more about tech.ml.dataset and dtype-next please refer their documentation:

SOURCE CODE

Join the discussion on Zulip

Let’s require main namespace and define dataset used in most examples:

(require '[tablecloth.api :as tc]
         '[tech.v3.datatype.functional :as dfn])
(def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
                      :V2 (range 1 10)
                      :V3 (take 9 (cycle [0.5 1.0 1.5]))
                      :V4 (take 9 (cycle ["A" "B" "C"]))}))
DS

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Functionality

Dataset

Dataset is a special type which can be considered as a map of columns implemented around tech.ml.dataset library. Each column can be considered as named sequence of typed data. Supported types include integers, floats, string, boolean, date/time, objects etc.

Dataset creation

Dataset can be created from various of types of Clojure structures and files:

  • single values
  • sequence of maps
  • map of sequences or values
  • sequence of columns (taken from other dataset or created manually)
  • sequence of pairs: [string column-data] or [keyword column-data]
  • array of any arrays
  • file types: raw/gzipped csv/tsv, json, xls(x) taken from local file system or URL
  • input stream

tc/dataset accepts:

  • data
  • options (see documentation of tech.ml.dataset/->dataset function for full list):
    • :dataset-name - name of the dataset
    • :num-rows - number of rows to read from file
    • :header-row? - indication if first row in file is a header
    • :key-fn - function applied to column names (eg. keyword, to convert column names to keywords)
    • :separator - column separator
    • :single-value-column-name - name of the column when single value is provided
    • :column-names - in case you want to name columns - only works for sequential input (arrays) or empty dataset
    • :layout - for numerical, native array of arrays - treat entries :as-rows or :as-columns (default)

tc/let-dataset accepts bindings symbol-column-data to simulate R’s tibble function. Each binding is converted into a column. You can refer previous columns to in further bindings (as in let).


Empty dataset.

(tc/dataset)
_unnamed [0 0]

Empty dataset with column names

(tc/dataset nil {:column-names [:a :b]})
_unnamed [0 2]:

| :a | :b |
|----|----|

Sequence of pairs (first = column name, second = value(s)).

(tc/dataset [[:A 33] [:B 5] [:C :a]])

_unnamed [1 3]:

:A:B:C
335:a

Not sequential values are repeated row-count number of times.

(tc/dataset [[:A [1 2 3 4 5 6]] [:B "X"] [:C :a]])

_unnamed [6 3]:

:A:B:C
1X:a
2X:a
3X:a
4X:a
5X:a
6X:a

Dataset created from map (keys = column names, vals = value(s)). Works the same as sequence of pairs.

(tc/dataset {:A 33})
(tc/dataset {:A [1 2 3]})
(tc/dataset {:A [3 4 5] :B "X"})

_unnamed [1 1]:

:A
33

_unnamed [3 1]:

:A
1
2
3

_unnamed [3 2]:

:A:B
3X
4X
5X

You can put any value inside a column

(tc/dataset {:A [[3 4 5] [:a :b]] :B "X"})

_unnamed [2 2]:

:A:B
[3 4 5]X
[:a :b]X

Sequence of maps

(tc/dataset [{:a 1 :b 3} {:b 2 :a 99}])
(tc/dataset [{:a 1 :b [1 2 3]} {:a 2 :b [3 4]}])

_unnamed [2 2]:

:a:b
13
992

_unnamed [2 2]:

:a:b
1[1 2 3]
2[3 4]

Missing values are marked by nil

(tc/dataset [{:a nil :b 1} {:a 3 :b 4} {:a 11}])

_unnamed [3 2]:

:a:b
1
34
11

Reading from arrays, by default :as-rows

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset))

:_unnamed [3 2]:

01
12
34
56

:as-columns

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset {:layout :as-columns}))

:_unnamed [2 3]:

012
135
246

:as-rows with names

(-> (map int-array [[1 2] [3 4] [5 6]])
    (into-array)
    (tc/dataset {:layout :as-rows
                 :column-names [:a :b]}))

:_unnamed [3 2]:

:a:b
12
34
56

Any objects

(-> (map to-array [[:a :z] ["ee" "ww"] [9 10]])
    (into-array)
    (tc/dataset {:column-names [:a :b :c]
                 :layout :as-columns}))

:_unnamed [2 3]:

:a:b:c
:aee9
:zww10

Create dataset using macro let-dataset to simulate R tibble function. Each binding is converted into a column.

(tc/let-dataset [x (range 1 6)
                  y 1
                  z (dfn/+ x y)])

_unnamed [5 3]:

:x:y:z
112
213
314
415
516

Import CSV file

(tc/dataset "data/family.csv")

data/family.csv [5 5]:

familydob_child1dob_child2gender_child1gender_child2
11998-11-262000-01-2912
21996-06-22 2
32002-07-112004-04-0522
42004-10-102009-08-2711
52000-12-052005-02-2821

Import from URL

(defonce ds (tc/dataset "https://vega.github.io/vega-lite/examples/data/seattle-weather.csv"))
ds

https://vega.github.io/vega-lite/examples/data/seattle-weather.csv [1461 6]:

dateprecipitationtemp_maxtemp_minwindweather
2012-01-010.012.85.04.7drizzle
2012-01-0210.910.62.84.5rain
2012-01-030.811.77.22.3rain
2012-01-0420.312.25.64.7rain
2012-01-051.38.92.86.1rain
2012-01-062.54.42.22.2rain
2012-01-070.07.22.82.3rain
2012-01-080.010.02.82.0sun
2012-01-094.39.45.03.4rain
2012-01-101.06.10.63.4rain
2015-12-2127.45.62.84.3rain
2015-12-224.67.82.85.0rain
2015-12-236.15.02.87.6rain
2015-12-242.55.62.24.3rain
2015-12-255.85.02.21.5rain
2015-12-260.04.40.02.5sun
2015-12-278.64.41.72.9rain
2015-12-281.55.01.71.3rain
2015-12-290.07.20.62.6fog
2015-12-300.05.6-1.03.4sun
2015-12-310.05.6-2.13.5sun

When none of above works, singleton dataset is created. Along with the error message from the exception thrown by tech.ml.dataset

(tc/dataset 999)

_unnamed [1 2]:

| :$value | :$error | | |------------------:|----------------------------------------------------| | 999 | Don’t know how to create ISeq from: java.lang.Long |

To see the stack trace, turn it on by setting :stack-trace? to true.


Set column name for single value. Also set the dataset name and turn off creating error message column.

(tc/dataset 999 {:single-value-column-name "my-single-value"
                 :error-column? false})
(tc/dataset 999 {:single-value-column-name ""
                 :dataset-name "Single value"
                 :error-column? false})

_unnamed [1 1]:

my-single-value
999

Single value [1 1]:

0
999

Saving

Export dataset to a file or output stream can be done by calling tc/write!. Function accepts:

  • dataset
  • file name with one of the extensions: .csv, .tsv, .csv.gz and .tsv.gz or output stream
  • options:
  • :separator - string or separator char.
(tc/write! ds "output.tsv.gz")
(.exists (clojure.java.io/file "output.tsv.gz"))
1462
true
Nippy
(tc/write! DS "output.nippy.gz")
nil
(tc/dataset "output.nippy.gz")

output.nippy.gz [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Dataset related functions

Summary functions about the dataset like number of rows, columns and basic stats.


Number of rows

(tc/row-count ds)
1461

Number of columns

(tc/column-count ds)
6

Shape of the dataset, [row count, column count]

(tc/shape ds)
[1461 6]

General info about dataset. There are three variants:

  • default - containing information about columns with basic statistics
  • :basic - just name, row and column count and information if dataset is a result of group-by operation
  • :columns - columns’ metadata
(tc/info ds)
(tc/info ds :basic)
(tc/info ds :columns)

https://vega.github.io/vega-lite/examples/data/seattle-weather.csv: descriptive-stats [6 12]:

:col-name:datatype:n-valid:n-missing:min:mean:mode:max:standard-deviation:skew:first:last
date:packed-local-date146102012-01-012013-12-31 2015-12-313.64520463E+101.98971418E-172012-01-012015-12-31
precipitation:float64146100.0003.029 55.906.68019432E+003.50564372E+000.0000.000
temp_max:float6414610-1.60016.44 35.607.34975810E+002.80929992E-0112.805.600
temp_min:float6414610-7.1008.235 18.305.02300418E+00-2.49458552E-015.000-2.100
wind:float64146100.40003.241 9.5001.43782506E+008.91667519E-014.7003.500
weather:string14610 rain drizzlesun

https://vega.github.io/vega-lite/examples/data/seattle-weather.csv :basic info [1 4]:

:name:grouped?:rows:columns
https://vega.github.io/vega-lite/examples/data/seattle-weather.csvfalse14616

https://vega.github.io/vega-lite/examples/data/seattle-weather.csv :column info [6 4]:

:name:datatype:n-elems:categorical?
date:packed-local-date1461
precipitation:float641461
temp_max:float641461
temp_min:float641461
wind:float641461
weather:string1461true

Getting a dataset name

(tc/dataset-name ds)
"https://vega.github.io/vega-lite/examples/data/seattle-weather.csv"

Setting a dataset name (operation is immutable).

(->> "seattle-weather"
     (tc/set-dataset-name ds)
     (tc/dataset-name))
"seattle-weather"

Columns and rows

Get columns and rows as sequences. column, columns and rows treat grouped dataset as regular one. See Groups to read more about grouped datasets.

Possible result types:

  • :as-seq or :as-seqs - sequence of seqences (default)
  • :as-maps - sequence of maps (rows)
  • :as-map - map of sequences (columns)
  • :as-double-arrays - array of double arrays
  • :as-vecs - sequence of vectors (rows)

For rows setting :nil-missing? option to false will elide keys for nil values.


Select column.

(ds "wind")
(tc/column ds "date")
#tech.v3.dataset.column<float64>[1461]
wind
[4.700, 4.500, 2.300, 4.700, 6.100, 2.200, 2.300, 2.000, 3.400, 3.400, 5.100, 1.900, 1.300, 5.300, 3.200, 5.000, 5.600, 5.000, 1.600, 2.300...]
#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...]

Columns as sequence

(take 2 (tc/columns ds))
(#tech.v3.dataset.column<packed-local-date>[1461]
date
[2012-01-01, 2012-01-02, 2012-01-03, 2012-01-04, 2012-01-05, 2012-01-06, 2012-01-07, 2012-01-08, 2012-01-09, 2012-01-10, 2012-01-11, 2012-01-12, 2012-01-13, 2012-01-14, 2012-01-15, 2012-01-16, 2012-01-17, 2012-01-18, 2012-01-19, 2012-01-20...] #tech.v3.dataset.column<float64>[1461]
precipitation
[0.000, 10.90, 0.8000, 20.30, 1.300, 2.500, 0.000, 0.000, 4.300, 1.000, 0.000, 0.000, 0.000, 4.100, 5.300, 2.500, 8.100, 19.80, 15.20, 13.50...])

Columns as map

(keys (tc/columns ds :as-map))
("date" "precipitation" "temp_max" "temp_min" "wind" "weather")

Rows as sequence of sequences

(take 2 (tc/rows ds))
([#object[java.time.LocalDate 0x6fdedb57 "2012-01-01"] 0.0 12.8 5.0 4.7 "drizzle"] [#object[java.time.LocalDate 0x31a1da23 "2012-01-02"] 10.9 10.6 2.8 4.5 "rain"])

Select rows/columns as double-double-array

(-> ds
    (tc/select-columns :type/numerical)
    (tc/head)
    (tc/rows :as-double-arrays))
#object["[[D" 0x620232bc "[[D@620232bc"]
(-> ds
    (tc/select-columns :type/numerical)
    (tc/head)
    (tc/columns :as-double-arrays))
#object["[[D" 0x73a86a0d "[[D@73a86a0d"]

Rows as sequence of maps

(clojure.pprint/pprint (take 2 (tc/rows ds :as-maps)))
({"date" #object[java.time.LocalDate 0x451a2065 "2012-01-01"],
  "precipitation" 0.0,
  "temp_max" 12.8,
  "temp_min" 5.0,
  "wind" 4.7,
  "weather" "drizzle"}
 {"date" #object[java.time.LocalDate 0x38ca78fb "2012-01-02"],
  "precipitation" 10.9,
  "temp_max" 10.6,
  "temp_min" 2.8,
  "wind" 4.5,
  "weather" "rain"})

Rows with missing values

(-> {:a [1 nil 2]
     :b [3 4 nil]}
    (tc/dataset)
    (tc/rows :as-maps))
[{:a 1, :b 3} {:a nil, :b 4} {:a 2, :b nil}]

Rows with elided missing values

(-> {:a [1 nil 2]
     :b [3 4 nil]}
    (tc/dataset)
    (tc/rows :as-maps {:nil-missing? false}))
[{:a 1, :b 3} {:b 4} {:a 2}]

Single entry

Get single value from the table using get-in from Clojure API or get-entry. First argument is column name, second is row number.

(get-in ds ["wind" 2])
2.3
(tc/get-entry ds "wind" 2)
2.3

Printing

Dataset is printed using dataset->str or print-dataset functions. Options are the same as in tech.ml.dataset/dataset-data->str. Most important is :print-line-policy which can be one of the: :single, :repl or :markdown.

(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :markdown})
_unnamed [2 3]:

| :name | :group-id |                                                                                                                                                                                                                                                             :data |
|------:|----------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|     1 |         0 | Group: 1 [5 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\|   1 \|   1 \| 0.5 \|   A \|<br>\|   1 \|   3 \| 1.5 \|   C \|<br>\|   1 \|   5 \| 1.0 \|   B \|<br>\|   1 \|   7 \| 0.5 \|   A \|<br>\|   1 \|   9 \| 1.5 \|   C \| |
|     2 |         1 |                                   Group: 2 [4 4]:<br><br>\| :V1 \| :V2 \| :V3 \| :V4 \|<br>\|----:\|----:\|----:\|-----\|<br>\|   2 \|   2 \| 1.0 \|   B \|<br>\|   2 \|   4 \| 0.5 \|   A \|<br>\|   2 \|   6 \| 1.5 \|   C \|<br>\|   2 \|   8 \| 1.0 \|   B \| |
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :repl})
_unnamed [2 3]:

| :name | :group-id |                          :data |
|------:|----------:|--------------------------------|
|     1 |         0 | Group: 1 [5 4]:                |
|       |           |                                |
|       |           | \| :V1 \| :V2 \| :V3 \| :V4 \| |
|       |           | \|----:\|----:\|----:\|-----\| |
|       |           | \|   1 \|   1 \| 0.5 \|   A \| |
|       |           | \|   1 \|   3 \| 1.5 \|   C \| |
|       |           | \|   1 \|   5 \| 1.0 \|   B \| |
|       |           | \|   1 \|   7 \| 0.5 \|   A \| |
|       |           | \|   1 \|   9 \| 1.5 \|   C \| |
|     2 |         1 | Group: 2 [4 4]:                |
|       |           |                                |
|       |           | \| :V1 \| :V2 \| :V3 \| :V4 \| |
|       |           | \|----:\|----:\|----:\|-----\| |
|       |           | \|   2 \|   2 \| 1.0 \|   B \| |
|       |           | \|   2 \|   4 \| 0.5 \|   A \| |
|       |           | \|   2 \|   6 \| 1.5 \|   C \| |
|       |           | \|   2 \|   8 \| 1.0 \|   B \| |
(tc/print-dataset (tc/group-by DS :V1) {:print-line-policy :single})
_unnamed [2 3]:

| :name | :group-id |           :data |
|------:|----------:|-----------------|
|     1 |         0 | Group: 1 [5 4]: |
|     2 |         1 | Group: 2 [4 4]: |

Group-by

Grouping by is an operation which splits dataset into subdatasets and packs it into new special type of… dataset. I distinguish two types of dataset: regular dataset and grouped dataset. The latter is the result of grouping.

Operations that perform a transformation on a regular dataset, generally apply that same transformation to individual sub-datasets in a grouped dataset. For example,

(tc/select-rows DS [0 1 2])
_unnamed [3 4]:

| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
|   1 |   1 | 0.5 |   A |
|   2 |   2 | 1.0 |   B |
|   1 |   3 | 1.5 |   C |

returns a dataset containing only the first three rows of DS, while

(-> DS
    (tc/group-by :V1)
    (tc/select-rows [0 1 2]))
_unnamed [2 3]:

| :name | :group-id |           :data |
|------:|----------:|-----------------|
|     1 |         0 | Group: 1 [3 4]: |
|     2 |         1 | Group: 2 [3 4]: |

returns a grouped dataset, in which each sub-dataset contains only the first three rows of the sub-datasets in the grouped dataset created by (tc/group-by DS :V1).

Almost all functions recognize type of the dataset (grouped or not) and operate accordingly.

However, you can’t apply reshaping or join/concat functions on grouped datasets.

Grouped dataset is annotated by the :grouped? meta tag and consists of the following columns:

  • :name - group name or structure
  • :group-id - integer assigned to the group
  • :data - groups as datasets

Grouping

Grouping is done by calling group-by function with arguments:

  • ds - dataset
  • grouping-selector - what to use for grouping
  • options:
  • :result-type - what to return:
  • :as-dataset (default) - return grouped dataset
  • :as-indexes - return rows ids (row number from original dataset)
  • :as-map - return map with group names as keys and subdataset as values
  • :as-seq - return sequens of subdatasets
  • :select-keys - list of the columns passed to a grouping selector function

All subdatasets (groups) have set name as the group name, additionally group-id is in meta.

Grouping can be done by:

  • single column name
  • seq of column names
  • value returned by function taking row as map (limited to :select-keys)
  • map of keys (arbitrary group names) to sequences of row indexes

In the case of the first three of these methods, each sub-dataset contains all and only rows from the original data set that share the same grouping value:

  • the value of the row in a specified single column
  • a map from column names to corresponding values found in the row
  • the value returned by the function taking row as map

In the case of the map from group names to sequences of indexes, each sub-dataset will contain all and only rows with the indexes listed in the sequence for a given group name (a key).

Note: currently dataset inside dataset is printed recursively so it renders poorly from markdown. So I will use :as-seq result type to show just group names and groups.


List of columns in grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/column-names))
(:V1 :V2 :V3 :V4)

List of columns in grouped dataset treated as regular dataset

(-> DS
    (tc/group-by :V1)
    (tc/as-regular-dataset)
    (tc/column-names))
(:name :group-id :data)

Content of the grouped dataset

(tc/columns (tc/group-by DS :V1) :as-map)
{:name #tech.v3.dataset.column<int64>[2]
:name
[1, 2], :group-id #tech.v3.dataset.column<int64>[2]
:group-id
[0, 1], :data #tech.v3.dataset.column<dataset>[2]
:data
[Group: 1 [5 4]:

| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
|   1 |   1 | 0.5 |   A |
|   1 |   3 | 1.5 |   C |
|   1 |   5 | 1.0 |   B |
|   1 |   7 | 0.5 |   A |
|   1 |   9 | 1.5 |   C |
, Group: 2 [4 4]:

| :V1 | :V2 | :V3 | :V4 |
|----:|----:|----:|-----|
|   2 |   2 | 1.0 |   B |
|   2 |   4 | 0.5 |   A |
|   2 |   6 | 1.5 |   C |
|   2 |   8 | 1.0 |   B |
]}

Grouped dataset as map

(keys (tc/group-by DS :V1 {:result-type :as-map}))
(1 2)
(vals (tc/group-by DS :V1 {:result-type :as-map}))

(Group: 1 [5 4]:

:V1:V2:V3:V4
110.5A
131.5C
151.0B
170.5A
191.5C

Group: 2 [4 4]:

:V1:V2:V3:V4
221.0B
240.5A
261.5C
281.0B

)


Group dataset as map of indexes (row ids)

(tc/group-by DS :V1 {:result-type :as-indexes})
{1 [0 2 4 6 8], 2 [1 3 5 7]}

Grouped datasets are printed as follows by default.

(tc/group-by DS :V1)

_unnamed [2 3]:

:name:group-id:data
10Group: 1 [5 4]:
21Group: 2 [4 4]:

To get groups as sequence or a map can be done from grouped dataset using groups->seq and groups->map functions.

Groups as seq can be obtained by just accessing :data column.

I will use temporary dataset here.

(let [ds (-> {"a" [1 1 2 2]
              "b" ["a" "b" "c" "d"]}
             (tc/dataset)
             (tc/group-by "a"))]
  (seq (ds :data))) ;; seq is not necessary but Markdown treats `:data` as command here

(Group: 1 [2 2]:

ab
1a
1b

Group: 2 [2 2]:

ab
2c
2d

)

(-> {"a" [1 1 2 2]
     "b" ["a" "b" "c" "d"]}
    (tc/dataset)
    (tc/group-by "a")
    (tc/groups->seq))

(Group: 1 [2 2]:

ab
1a
1b

Group: 2 [2 2]:

ab
2c
2d

)


Groups as map

(-> {"a" [1 1 2 2]
     "b" ["a" "b" "c" "d"]}
    (tc/dataset)
    (tc/group-by "a")
    (tc/groups->map))

{1 Group: 1 [2 2]:

ab
1a
1b

, 2 Group: 2 [2 2]:

ab
2c
2d

}


Grouping by more than one column. You can see that group names are maps. When ungrouping is done these maps are used to restore column names.

(tc/group-by DS [:V1 :V3] {:result-type :as-seq})

(Group: {:V1 1, :V3 0.5} [2 4]:

:V1:V2:V3:V4
110.5A
170.5A

Group: {:V1 2, :V3 1.0} [2 4]:

:V1:V2:V3:V4
221.0B
281.0B

Group: {:V1 1, :V3 1.5} [2 4]:

:V1:V2:V3:V4
131.5C
191.5C

Group: {:V1 2, :V3 0.5} [1 4]:

:V1:V2:V3:V4
240.5A

Group: {:V1 1, :V3 1.0} [1 4]:

:V1:V2:V3:V4
151.0B

Group: {:V1 2, :V3 1.5} [1 4]:

:V1:V2:V3:V4
261.5C

)


Grouping can be done by providing just row indexes. This way you can assign the same row to more than one group.

(tc/group-by DS {"group-a" [1 2 1 2]
                  "group-b" [5 5 5 1]} {:result-type :as-seq})

(Group: group-a [4 4]:

:V1:V2:V3:V4
221.0B
131.5C
221.0B
131.5C

Group: group-b [4 4]:

:V1:V2:V3:V4
261.5C
261.5C
261.5C
221.0B

)


You can group by a result of grouping function which gets row as map and should return group name. When map is used as a group name, ungrouping restore original column names.

(tc/group-by DS (fn [row] (* (:V1 row)
                             (:V3 row))) {:result-type :as-seq})

(Group: 0.5 [2 4]:

:V1:V2:V3:V4
110.5A
170.5A

Group: 2.0 [2 4]:

:V1:V2:V3:V4
221.0B
281.0B

Group: 1.5 [2 4]:

:V1:V2:V3:V4
131.5C
191.5C

Group: 1.0 [2 4]:

:V1:V2:V3:V4
240.5A
151.0B

Group: 3.0 [1 4]:

:V1:V2:V3:V4
261.5C

)


You can use any predicate on column to split dataset into two groups.

(tc/group-by DS (comp #(< % 1.0) :V3) {:result-type :as-seq})

(Group: true [3 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A

Group: false [6 4]:

:V1:V2:V3:V4
221.0B
131.5C
151.0B
261.5C
281.0B
191.5C

)


juxt is also helpful

(tc/group-by DS (juxt :V1 :V3) {:result-type :as-seq})

(Group: [1 0.5] [2 4]:

:V1:V2:V3:V4
110.5A
170.5A

Group: [2 1.0] [2 4]:

:V1:V2:V3:V4
221.0B
281.0B

Group: [1 1.5] [2 4]:

:V1:V2:V3:V4
131.5C
191.5C

Group: [2 0.5] [1 4]:

:V1:V2:V3:V4
240.5A

Group: [1 1.0] [1 4]:

:V1:V2:V3:V4
151.0B

Group: [2 1.5] [1 4]:

:V1:V2:V3:V4
261.5C

)


tech.ml.dataset provides an option to limit columns which are passed to grouping functions. It’s done for performance purposes.

(tc/group-by DS identity {:result-type :as-seq
                           :select-keys [:V1]})

(Group: {:V1 1} [5 4]:

:V1:V2:V3:V4
110.5A
131.5C
151.0B
170.5A
191.5C

Group: {:V1 2} [4 4]:

:V1:V2:V3:V4
221.0B
240.5A
261.5C
281.0B

)

Ungrouping

Ungrouping simply concats all the groups into the dataset. Following options are possible

  • :order? - order groups according to the group name ascending order. Default: false
  • :add-group-as-column - should group name become a column? If yes column is created with provided name (or :$group-name if argument is true). Default: nil.
  • :add-group-id-as-column - should group id become a column? If yes column is created with provided name (or :$group-id if argument is true). Default: nil.
  • :dataset-name - to name resulting dataset. Default: nil (_unnamed)

If group name is a map, it will be splitted into separate columns. Be sure that groups (subdatasets) doesn’t contain the same columns already.

If group name is a vector, it will be splitted into separate columns. If you want to name them, set vector of target column names as :add-group-as-column argument.

After ungrouping, order of the rows is kept within the groups but groups are ordered according to the internal storage.


Grouping and ungrouping.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup))

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A
221.0B
151.0B
281.0B
131.5C
261.5C
191.5C

Groups sorted by group name and named.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup {:order? true
                  :dataset-name "Ordered by V3"}))

Ordered by V3 [9 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A
221.0B
151.0B
281.0B
131.5C
261.5C
191.5C

Groups sorted descending by group name and named.

(-> DS
    (tc/group-by :V3)
    (tc/ungroup {:order? :desc
                  :dataset-name "Ordered by V3 descending"}))

Ordered by V3 descending [9 4]:

:V1:V2:V3:V4
131.5C
261.5C
191.5C
221.0B
151.0B
281.0B
110.5A
240.5A
170.5A

Let’s add group name and id as additional columns

(-> DS
    (tc/group-by (comp #(< % 4) :V2))
    (tc/ungroup {:add-group-as-column true
                  :add-group-id-as-column true}))

_unnamed [9 6]:

| :$group-name | :$group-id | :V1 | :V2 | :V3 | :V4 | | |---------------------------|----:|----:|----:|----:|-----| | true | 0 | 1 | 1 | 0.5 | A | | true | 0 | 2 | 2 | 1.0 | B | | true | 0 | 1 | 3 | 1.5 | C | | false | 1 | 2 | 4 | 0.5 | A | | false | 1 | 1 | 5 | 1.0 | B | | false | 1 | 2 | 6 | 1.5 | C | | false | 1 | 1 | 7 | 0.5 | A | | false | 1 | 2 | 8 | 1.0 | B | | false | 1 | 1 | 9 | 1.5 | C |


Let’s assign different column names

(-> DS
    (tc/group-by (comp #(< % 4) :V2))
    (tc/ungroup {:add-group-as-column "Is V2 less than 4?"
                  :add-group-id-as-column "group id"}))

_unnamed [9 6]:

Is V2 less than 4?group id:V1:V2:V3:V4
true0110.5A
true0221.0B
true0131.5C
false1240.5A
false1151.0B
false1261.5C
false1170.5A
false1281.0B
false1191.5C

If we group by map, we can automatically create new columns out of group names.

(-> DS
    (tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
                                                      (:V3 row))
                            "V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
    (tc/ungroup {:add-group-as-column true}))

_unnamed [9 6]:

V1 and V3 multipliedV4 as lowercase:V1:V2:V3:V4
0.5a110.5A
0.5a170.5A
2.0b221.0B
2.0b281.0B
1.5c131.5C
1.5c191.5C
1.0a240.5A
1.0b151.0B
3.0c261.5C

We can add group names without separation

(-> DS
    (tc/group-by (fn [row] {"V1 and V3 multiplied" (* (:V1 row)
                                                      (:V3 row))
                            "V4 as lowercase" (clojure.string/lower-case (:V4 row))}))
    (tc/ungroup {:add-group-as-column "just map"
                  :separate? false}))

_unnamed [9 5]:

just map:V1:V2:V3:V4
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”}110.5A
{“V1 and V3 multiplied” 0.5, “V4 as lowercase” “a”}170.5A
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”}221.0B
{“V1 and V3 multiplied” 2.0, “V4 as lowercase” “b”}281.0B
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”}131.5C
{“V1 and V3 multiplied” 1.5, “V4 as lowercase” “c”}191.5C
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “a”}240.5A
{“V1 and V3 multiplied” 1.0, “V4 as lowercase” “b”}151.0B
{“V1 and V3 multiplied” 3.0, “V4 as lowercase” “c”}261.5C

The same applies to group names as sequences

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:add-group-as-column "abc"}))

_unnamed [9 6]:

:abc-0:abc-1:V1:V2:V3:V4
10.5110.5A
10.5170.5A
21.0221.0B
21.0281.0B
11.5131.5C
11.5191.5C
20.5240.5A
11.0151.0B
21.5261.5C

Let’s provide column names

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:add-group-as-column ["v1" "v3"]}))

_unnamed [9 6]:

v1v3:V1:V2:V3:V4
10.5110.5A
10.5170.5A
21.0221.0B
21.0281.0B
11.5131.5C
11.5191.5C
20.5240.5A
11.0151.0B
21.5261.5C

Also we can supress separation

(-> DS
    (tc/group-by (juxt :V1 :V3))
    (tc/ungroup {:separate? false
                  :add-group-as-column true}))
;; => _unnamed [9 5]:

_unnamed [9 5]:

:$group-name:V1:V2:V3:V4
[1 0.5]110.5A
[1 0.5]170.5A
[2 1.0]221.0B
[2 1.0]281.0B
[1 1.5]131.5C
[1 1.5]191.5C
[2 0.5]240.5A
[1 1.0]151.0B
[2 1.5]261.5C

Other functions

To check if dataset is grouped or not just use grouped? function.

(tc/grouped? DS)
nil
(tc/grouped? (tc/group-by DS :V1))
true

If you want to remove grouping annotation (to make all the functions work as with regular dataset) you can use unmark-group or as-regular-dataset (alias) functions.

It can be important when you want to remove some groups (rows) from grouped dataset using drop-rows or something like that.

(-> DS
    (tc/group-by :V1)
    (tc/as-regular-dataset)
    (tc/grouped?))
nil

You can also operate on grouped dataset as a regular one in case you want to access its columns using without-grouping-> threading macro.

(-> DS
    (tc/group-by [:V4 :V1])
    (tc/without-grouping->
     (tc/order-by (comp (juxt :V4 :V1) :name))))

_unnamed [6 3]:

:name:group-id:data
{:V4 “A”, :V1 1}0Group: {:V4 “A”, :V1 1} [2 4]:
{:V4 “A”, :V1 2}3Group: {:V4 “A”, :V1 2} [1 4]:
{:V4 “B”, :V1 1}4Group: {:V4 “B”, :V1 1} [1 4]:
{:V4 “B”, :V1 2}1Group: {:V4 “B”, :V1 2} [2 4]:
{:V4 “C”, :V1 1}2Group: {:V4 “C”, :V1 1} [2 4]:
{:V4 “C”, :V1 2}5Group: {:V4 “C”, :V1 2} [1 4]:

This is considered internal.

If you want to implement your own mapping function on grouped dataset you can call process-group-data and pass function operating on datasets. Result should be a dataset to have ungrouping working.

(-> DS
    (tc/group-by :V1)
    (tc/process-group-data #(str "Shape: " (vector (tc/row-count %) (tc/column-count %))))
    (tc/as-regular-dataset))

_unnamed [2 3]:

:name:group-id:data
10Shape: [5 4]
21Shape: [4 4]

Columns

Column is a special tech.ml.dataset structure. For our purposes we cat treat columns as typed and named sequence bound to particular dataset.

Type of the data is inferred from a sequence during column creation.

Names

To select dataset columns or column names columns-selector is used. columns-selector can be one of the following:

  • :all keyword - selects all columns
  • column name - for single column
  • sequence of column names - for collection of columns
  • regex - to apply pattern on column names or datatype
  • filter predicate - to filter column names or datatype
  • type namespaced keyword for specific datatype or group of datatypes

Column name can be anything.

column-names function returns names according to columns-selector and optional meta-field. meta-field is one of the following:

  • :name (default) - to operate on column names
  • :datatype - to operated on column types
  • :all - if you want to process all metadata

Datatype groups are:

  • :type/numerical - any numerical type
  • :type/float - floating point number (:float32 and :float64)
  • :type/integer - any integer
  • :type/datetime - any datetime type

If qualified keyword starts with :!type, complement set is used.


To select all column names you can use column-names function.

(tc/column-names DS)
(:V1 :V2 :V3 :V4)

or

(tc/column-names DS :all)
(:V1 :V2 :V3 :V4)

In case you want to select column which has name :all (or is sequence or map), put it into a vector. Below code returns empty sequence since there is no such column in the dataset.

(tc/column-names DS [:all])
()

Obviously selecting single name returns it’s name if available

(tc/column-names DS :V1)
(tc/column-names DS "no such column")
(:V1)
()

Select sequence of column names.

(tc/column-names DS [:V1 "V2" :V3 :V4 :V5])
(:V1 :V3 :V4)

Select names based on regex, columns ends with 1 or 4

(tc/column-names DS #".*[14]")
(:V1 :V4)

Select names based on regex operating on type of the column (to check what are the column types, call (tc/info DS :columns). Here we want to get integer columns only.

(tc/column-names DS #"^:int.*" :datatype)
(:V1 :V2)

or

(tc/column-names DS :type/integer)
(:V1 :V2)

And finally we can use predicate to select names. Let’s select double precision columns.

(tc/column-names DS #{:float64} :datatype)
(:V3)

or

(tc/column-names DS :type/float64)
(:V3)

If you want to select all columns but given, use complement function. Works only on a predicate.

(tc/column-names DS (complement #{:V1}))
(tc/column-names DS (complement #{:float64}) :datatype)
(tc/column-names DS :!type/float64)
(:V2 :V3 :V4)
(:V1 :V2 :V4)
(:V1 :V2 :V4)

You can select column names based on all column metadata at once by using :all metadata selector. Below we want to select column names ending with 1 which have long datatype.

(tc/column-names DS (fn [meta]
                       (and (= :int64 (:datatype meta))
                            (clojure.string/ends-with? (:name meta) "1"))) :all)
(:V1)

Select

select-columns creates dataset with columns selected by columns-selector as described above. Function works on regular and grouped dataset.


Select only float64 columns

(tc/select-columns DS #(= :float64 %) :datatype)

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5

or

(tc/select-columns DS :type/float64)

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5

Select all but :V1 columns

(tc/select-columns DS (complement #{:V1}))

_unnamed [9 3]:

:V2:V3:V4
10.5A
21.0B
31.5C
40.5A
51.0B
61.5C
70.5A
81.0B
91.5C

If we have grouped data set, column selection is applied to every group separately.

(-> DS
    (tc/group-by :V1)
    (tc/select-columns [:V2 :V3])
    (tc/groups->map))

{1 Group: 1 [5 2]:

:V2:V3
10.5
31.5
51.0
70.5
91.5

, 2 Group: 2 [4 2]:

:V2:V3
21.0
40.5
61.5
81.0

}

Drop

drop-columns creates dataset with removed columns.


Drop float64 columns

(tc/drop-columns DS #(= :float64 %) :datatype)

_unnamed [9 3]:

:V1:V2:V4
11A
22B
13C
24A
15B
26C
17A
28B
19C

or

(tc/drop-columns DS :type/float64)

_unnamed [9 3]:

:V1:V2:V4
11A
22B
13C
24A
15B
26C
17A
28B
19C

Drop all columns but :V1 and :V2

(tc/drop-columns DS (complement #{:V1 :V2}))

_unnamed [9 2]:

:V1:V2
11
22
13
24
15
26
17
28
19

If we have grouped data set, column selection is applied to every group separately. Selected columns are dropped.

(-> DS
    (tc/group-by :V1)
    (tc/drop-columns [:V2 :V3])
    (tc/groups->map))

{1 Group: 1 [5 2]:

:V1:V4
1A
1C
1B
1A
1C

, 2 Group: 2 [4 2]:

:V1:V4
2B
2A
2C
2B

}

Rename

If you want to rename colums use rename-columns and pass map where keys are old names, values new ones.

You can also pass mapping function with optional columns-selector

(tc/rename-columns DS {:V1 "v1"
                        :V2 "v2"
                        :V3 [1 2 3]
                        :V4 (Object.)})

_unnamed [9 4]:

v1v2[1 2 3]java.lang.Object@1e88312b
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Map all names with function

(tc/rename-columns DS (comp str second name))

_unnamed [9 4]:

1234
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Map selected names with function

(tc/rename-columns DS [:V1 :V3] (comp str second name))

_unnamed [9 4]:

1:V23:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Function works on grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/rename-columns {:V1 "v1"
                         :V2 "v2"
                         :V3 [1 2 3]
                         :V4 (Object.)})
    (tc/groups->map))

{1 Group: 1 [5 4]:

v1v2[1 2 3]java.lang.Object@753f3182
110.5A
131.5C
151.0B
170.5A
191.5C

, 2 Group: 2 [4 4]:

v1v2[1 2 3]java.lang.Object@753f3182
221.0B
240.5A
261.5C
281.0B

}

Add or update

To add (or replace existing) column call add-column function. Function accepts:

  • ds - a dataset
  • column-name - if it’s existing column name, column will be replaced
  • column - can be column (from other dataset), sequence, single value or function. Too big columns are always trimmed. Too small are cycled or extended with missing values (according to size-strategy argument)
  • size-strategy (optional) - when new column is shorter than dataset row count, following strategies are applied:
    • :cycle - repeat data
    • :na - append missing values
    • :strict - (default) throws an exception when sizes mismatch

Function works on grouped dataset.


Add single value as column

(tc/add-column DS :V5 "X")

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5AX
221.0BX
131.5CX
240.5AX
151.0BX
261.5CX
170.5AX
281.0BX
191.5CX

Replace one column (column is trimmed)

(tc/add-column DS :V1 (repeatedly rand))

_unnamed [9 4]:

:V1:V2:V3:V4
0.2830211710.5A
0.5563984721.0B
0.4123534031.5C
0.0451303740.5A
0.2780056951.0B
0.1852036561.5C
0.1447702070.5A
0.1633512581.0B
0.5322410591.5C

Copy column

(tc/add-column DS :V5 (DS :V1))

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5A1
221.0B2
131.5C1
240.5A2
151.0B1
261.5C2
170.5A1
281.0B2
191.5C1

When function is used, argument is whole dataset and the result should be column, sequence or single value

(tc/add-column DS :row-count tc/row-count)

_unnamed [9 5]:

:V1:V2:V3:V4:row-count
110.5A9
221.0B9
131.5C9
240.5A9
151.0B9
261.5C9
170.5A9
281.0B9
191.5C9

Above example run on grouped dataset, applies function on each group separately.

(-> DS
    (tc/group-by :V1)
    (tc/add-column :row-count tc/row-count)
    (tc/ungroup))

_unnamed [9 5]:

:V1:V2:V3:V4:row-count
110.5A5
131.5C5
151.0B5
170.5A5
191.5C5
221.0B4
240.5A4
261.5C4
281.0B4

When column which is added is longer than row count in dataset, column is trimmed. When column is shorter, it’s cycled or missing values are appended.

(tc/add-column DS :V5 [:r :b] :cycle)

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5A:r
221.0B:b
131.5C:r
240.5A:b
151.0B:r
261.5C:b
170.5A:r
281.0B:b
191.5C:r
(tc/add-column DS :V5 [:r :b] :na)

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5A:r
221.0B:b
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Exception is thrown when :strict (default) strategy is used and column size is not equal row count

(try
  (tc/add-column DS :V5 [:r :b])
  (catch Exception e (str "Exception caught: "(ex-message e))))
"Exception caught: Column size (2) should be exactly the same as dataset row count (9). Consider `:cycle` or `:na` strategy."

Tha same applies for grouped dataset

(-> DS
    (tc/group-by :V3)
    (tc/add-column :V5 [:r :b] :na)
    (tc/ungroup))

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5A:r
240.5A:b
170.5A
221.0B:r
151.0B:b
281.0B
131.5C:r
261.5C:b
191.5C

Let’s use other column to fill groups

(-> DS
    (tc/group-by :V3)
    (tc/add-column :V5 (DS :V2) :cycle)
    (tc/ungroup))

_unnamed [9 5]:

:V1:V2:V3:V4:V5
110.5A1
240.5A2
170.5A3
221.0B1
151.0B2
281.0B3
131.5C1
261.5C2
191.5C3

In case you want to add or update several columns you can call add-columns and provide map where keys are column names, vals are columns.

(tc/add-columns DS {:V1 #(map inc (% :V1))
                               :V5 #(map (comp keyword str) (% :V4))
                               :V6 11})

_unnamed [9 6]:

:V1:V2:V3:V4:V5:V6
210.5A:A11
321.0B:B11
231.5C:C11
340.5A:A11
251.0B:B11
361.5C:C11
270.5A:A11
381.0B:B11
291.5C:C11

Update

If you want to modify specific column(s) you can call update-columns. Arguments:

  • dataset
  • one of:
    • columns-selector and function (or sequence of functions)
    • map where keys are column names and vals are function

Functions accept column and have to return column or sequence


Reverse of columns

(tc/update-columns DS :all reverse) 

_unnamed [9 4]:

:V1:V2:V3:V4
191.5C
281.0B
170.5A
261.5C
151.0B
240.5A
131.5C
221.0B
110.5A

Apply dec/inc on numerical columns

(tc/update-columns DS :type/numerical [(partial map dec)
                                        (partial map inc)])

_unnamed [9 4]:

:V1:V2:V3:V4
02-0.5A
130.0B
040.5C
15-0.5A
060.0B
170.5C
08-0.5A
190.0B
0100.5C

You can also assign a function to a column by packing operations into the map.

(tc/update-columns DS {:V1 reverse
                        :V2 (comp shuffle seq)})

_unnamed [9 4]:

:V1:V2:V3:V4
160.5A
291.0B
131.5C
250.5A
141.0B
281.5C
170.5A
211.0B
121.5C

Map

The other way of creating or updating column is to map rows as regular map function. The arity of mapping function should be the same as number of selected columns.

Arguments:

  • ds - dataset
  • column-name - target column name
  • columns-selector - columns selected
  • map-fn - mapping function

Let’s add numerical columns together

(tc/map-columns DS
                 :sum-of-numbers
                 (tc/column-names DS  #{:int64 :float64} :datatype)
                 (fn [& rows]
                   (reduce + rows)))

_unnamed [9 5]:

:V1:V2:V3:V4:sum-of-numbers
110.5A2.5
221.0B5.0
131.5C5.5
240.5A6.5
151.0B7.0
261.5C9.5
170.5A8.5
281.0B11.0
191.5C11.5

The same works on grouped dataset

(-> DS
    (tc/group-by :V4)
    (tc/map-columns :sum-of-numbers
                     (tc/column-names DS  #{:int64 :float64} :datatype)
                     (fn [& rows]
                       (reduce + rows)))
    (tc/ungroup))

_unnamed [9 5]:

:V1:V2:V3:V4:sum-of-numbers
110.5A2.5
240.5A6.5
170.5A8.5
221.0B5.0
151.0B7.0
281.0B11.0
131.5C5.5
261.5C9.5
191.5C11.5

Reorder

To reorder columns use columns selectors to choose what columns go first. The unseleted columns are appended to the end.

(tc/reorder-columns DS :V4 [:V3 :V2])

_unnamed [9 4]:

:V4:V3:V2:V1
A0.511
B1.022
C1.531
A0.542
B1.051
C1.562
A0.571
B1.082
C1.591

This function doesn’t let you select meta field, so you have to call column-names in such case. Below we want to add integer columns at the end.

(tc/reorder-columns DS (tc/column-names DS (complement #{:int64}) :datatype))

_unnamed [9 4]:

:V3:V4:V1:V2
0.5A11
1.0B22
1.5C13
0.5A24
1.0B15
1.5C26
0.5A17
1.0B28
1.5C19

Type conversion

To convert column into given datatype can be done using convert-types function. Not all the types can be converted automatically also some types require slow parsing (every conversion from string). In case where conversion is not possible you can pass conversion function.

Arguments:

  • ds - dataset
  • Two options:
    • coltype-map in case when you want to convert several columns, keys are column names, vals are new types
    • column-selector and new-types - column name and new datatype (or datatypes as sequence)

new-types can be:

  • a type like :int64 or :string or sequence of types
  • or sequence of pair of datetype and conversion function

After conversion additional infomation is given on problematic values.

The other conversion is casting column into java array (->array) of the type column or provided as argument. Grouped dataset returns sequence of arrays.


Basic conversion

(-> DS
    (tc/convert-types :V1 :float64)
    (tc/info :columns))

_unnamed :column info [4 6]:

:name:datatype:n-elems:unparsed-indexes:unparsed-data:categorical?
:V1:float649{}[]
:V2:int649
:V3:float649
:V4:string9 true

Using custom converter. Let’s treat :V4 as haxadecimal values. See that this way we can map column to any value.

(-> DS
    (tc/convert-types :V4 [[:int16 #(Integer/parseInt % 16)]]))

_unnamed [9 4]:

:V1:V2:V3:V4
110.510
221.011
131.512
240.510
151.011
261.512
170.510
281.011
191.512

You can process several columns at once

(-> DS
    (tc/convert-types {:V1 :float64
                        :V2 :object
                        :V3 [:boolean #(< % 1.0)]
                        :V4 :object})
    (tc/info :columns))

_unnamed :column info [4 6]:

:name:datatype:n-elems:unparsed-indexes:unparsed-data:categorical?
:V1:float649{}[]
:V2:object9{}[]true
:V3:boolean9{}[]
:V4:object9 true

Convert one type into another

(-> DS
    (tc/convert-types :type/numerical :int16)
    (tc/info :columns))

_unnamed :column info [4 6]:

:name:datatype:n-elems:unparsed-indexes:unparsed-data:categorical?
:V1:int169{}[]
:V2:int169{}[]
:V3:int169{}[]
:V4:string9 true

Function works on the grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/convert-types :V1 :float32)
    (tc/ungroup)
    (tc/info :columns))

_unnamed :column info [4 6]:

:name:datatype:n-elems:unparsed-indexes:unparsed-data:categorical?
:V1:float329{}[]
:V2:int649
:V3:float649
:V4:string9 true

Double array conversion.

(tc/->array DS :V1)
#object["[J" 0x5b89e23 "[J@5b89e23"]

Function also works on grouped dataset

(-> DS
    (tc/group-by :V3)
    (tc/->array :V2))
(#object["[J" 0x7db5ee70 "[J@7db5ee70"] #object["[J" 0x7bcd9eb8 "[J@7bcd9eb8"] #object["[J" 0x2c9c99f2 "[J@2c9c99f2"])

You can also cast the type to the other one (if casting is possible):

(tc/->array DS :V4 :string)
(tc/->array DS :V1 :float32)
#object["[Ljava.lang.String;" 0x4b448561 "[Ljava.lang.String;@4b448561"]
#object["[F" 0x7a0e5ace "[F@7a0e5ace"]

Rows

Rows can be selected or dropped using various selectors:

  • row id(s) - row index as number or seqence of numbers (first row has index 0, second 1 and so on)
  • sequence of true/false values
  • filter by predicate (argument is row as a map)

When predicate is used you may want to limit columns passed to the function (select-keys option).

Additionally you may want to precalculate some values which will be visible for predicate as additional columns. It’s done internally by calling add-columns on a dataset. :pre is used as a column definitions.

Select

Select fifth row

(tc/select-rows DS 4)

_unnamed [1 4]:

:V1:V2:V3:V4
151.0B

Select 3 rows

(tc/select-rows DS [1 4 5])

_unnamed [3 4]:

:V1:V2:V3:V4
221.0B
151.0B
261.5C

Select rows using sequence of true/false values

(tc/select-rows DS [true nil nil true])

_unnamed [2 4]:

:V1:V2:V3:V4
110.5A
240.5A

Select rows using predicate

(tc/select-rows DS (comp #(< % 1) :V3))

_unnamed [3 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A

The same works on grouped dataset, let’s select first row from every group.

(-> DS
    (tc/group-by :V1)
    (tc/select-rows 0)
    (tc/ungroup))

_unnamed [2 4]:

:V1:V2:V3:V4
110.5A
221.0B

If you want to select :V2 values which are lower than or equal mean in grouped dataset you have to precalculate it using :pre.

(-> DS
    (tc/group-by :V4)
    (tc/select-rows (fn [row] (<= (:V2 row) (:mean row)))
                     {:pre {:mean #(tech.v3.datatype.functional/mean (% :V2))}})
    (tc/ungroup))

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
240.5A
221.0B
151.0B
131.5C
261.5C

Drop

drop-rows removes rows, and accepts exactly the same parameters as select-rows


Drop values lower than or equal :V2 column mean in grouped dataset.

(-> DS
    (tc/group-by :V4)
    (tc/drop-rows (fn [row] (<= (:V2 row) (:mean row)))
                   {:pre {:mean #(tech.v3.datatype.functional/mean (% :V2))}})
    (tc/ungroup))

_unnamed [3 4]:

:V1:V2:V3:V4
170.5A
281.0B
191.5C

Map rows

Call a mapping function for every row. Mapping function should return a map, where keys are column names (new or old) and values are column values.

Works on grouped dataset too.

(tc/map-rows DS (fn [{:keys [V1 V2]}] {:V1 0
                                       :V5 (/ (+ V1 V2) (double V2))}))

_unnamed [9 5]:

:V1:V2:V3:V4:V5
010.5A2.00000000
021.0B2.00000000
031.5C1.33333333
040.5A1.50000000
051.0B1.20000000
061.5C1.33333333
070.5A1.14285714
081.0B1.25000000
091.5C1.11111111

Other

There are several function to select first, last, random rows, or display head, tail of the dataset. All functions work on grouped dataset.

All random functions accept :seed as an option if you want to fix returned result.


First row

(tc/first DS)

_unnamed [1 4]:

:V1:V2:V3:V4
110.5A

Last row

(tc/last DS)

_unnamed [1 4]:

:V1:V2:V3:V4
191.5C

Random row (single)

(tc/rand-nth DS)

_unnamed [1 4]:

:V1:V2:V3:V4
281.0B

Random row (single) with seed

(tc/rand-nth DS {:seed 42})

_unnamed [1 4]:

:V1:V2:V3:V4
261.5C

Random n (default: row count) rows with repetition.

(tc/random DS)

_unnamed [9 4]:

:V1:V2:V3:V4
170.5A
240.5A
261.5C
151.0B
261.5C
261.5C
131.5C
131.5C
170.5A
261.5C
281.0B
221.0B

Five random rows with repetition

(tc/random DS 5)

_unnamed [5 4]:

:V1:V2:V3:V4
110.5A
240.5A
191.5C
221.0B
170.5A
281.0B
261.5C
110.5A

Five random, non-repeating rows

(tc/random DS 5 {:repeat? false})

_unnamed [5 4]:

:V1:V2:V3:V4
131.5C
170.5A
240.5A
191.5C
221.0B

Five random, with seed

(tc/random DS 5 {:seed 42})

_unnamed [5 4]:

:V1:V2:V3:V4
261.5C
151.0B
131.5C
110.5A
191.5C

Shuffle dataset

(tc/shuffle DS)

_unnamed [9 4]:

:V1:V2:V3:V4
151.0B
131.5C
281.0B
240.5A
261.5C
221.0B
191.5C
110.5A
170.5A

Shuffle with seed

(tc/shuffle DS {:seed 42})

_unnamed [9 4]:

:V1:V2:V3:V4
151.0B
221.0B
261.5C
240.5A
281.0B
131.5C
170.5A
110.5A
191.5C

First n rows (default 5)

(tc/head DS)

_unnamed [5 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B

Last n rows (default 5)

(tc/tail DS)

_unnamed [5 4]:

:V1:V2:V3:V4
151.0B
261.5C
170.5A
281.0B
191.5C

by-rank calculates rank on column(s). It’s base on R rank() with addition of :dense (default) tie strategy which give consecutive rank numbering.

:desc? options (default: true) sorts input with descending order, giving top values under 0 value.

rank is zero based and is defined at tablecloth.api.utils namespace.


(tc/by-rank DS :V3 zero?) ;; most V3 values

_unnamed [3 4]:

:V1:V2:V3:V4
131.5C
261.5C
191.5C
(tc/by-rank DS :V3 zero? {:desc? false}) ;; least V3 values

_unnamed [3 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A

Rank also works on multiple columns

(tc/by-rank DS [:V1 :V3] zero? {:desc? false})

_unnamed [2 4]:

:V1:V2:V3:V4
110.5A
170.5A

Select 5 random rows from each group

(-> DS
    (tc/group-by :V4)
    (tc/random 5)
    (tc/ungroup))

_unnamed [15 4]:

:V1:V2:V3:V4
170.5A
110.5A
110.5A
240.5A
170.5A
170.5A
110.5A
151.0B
281.0B
281.0B
151.0B
221.0B
261.5C
191.5C
131.5C
191.5C
261.5C

Aggregate

Aggregating is a function which produces single row out of dataset.

Aggregator is a function or sequence or map of functions which accept dataset as an argument and result single value, sequence of values or map.

Where map is given as an input or result, keys are treated as column names.

Grouped dataset is ungrouped after aggregation. This can be turned off by setting :ungroup? to false. In case you want to pass additional ungrouping parameters add them to the options.

By default resulting column names are prefixed with summary prefix (set it with :default-column-name-prefix option).


Let’s calculate mean of some columns

(tc/aggregate DS #(reduce + (% :V2)))

_unnamed [1 1]:

summary
45

Let’s give resulting column a name.

(tc/aggregate DS {:sum-of-V2 #(reduce + (% :V2))})

_unnamed [1 1]:

:sum-of-V2
45

Sequential result is spread into separate columns

(tc/aggregate DS #(take 5(% :V2)))

_unnamed [1 5]:

:summary-0:summary-1:summary-2:summary-3:summary-4
12345

You can combine all variants and rename default prefix

(tc/aggregate DS [#(take 3 (% :V2))
                   (fn [ds] {:sum-v1 (reduce + (ds :V1))
                            :prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"})

_unnamed [1 5]:

:V2-value-0-0:V2-value-0-1:V2-value-0-2:V2-value-1-sum-v1:V2-value-1-prod-v3
123130.421875

Processing grouped dataset

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate [#(take 3 (% :V2))
                    (fn [ds] {:sum-v1 (reduce + (ds :V1))
                             :prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"}))

_unnamed [3 6]:

:V4:V2-value-0-0:V2-value-0-1:V2-value-0-2:V2-value-1-sum-v1:V2-value-1-prod-v3
A14740.125
B25851.000
C36943.375

Result of aggregating is automatically ungrouped, you can skip this step by stetting :ungroup? option to false.

(-> DS
    (tc/group-by [:V3])
    (tc/aggregate [#(take 3 (% :V2))
                    (fn [ds] {:sum-v1 (reduce + (ds :V1))
                             :prod-v3 (reduce * (ds :V3))})] {:default-column-name-prefix "V2-value"
                                                              :ungroup? false}))

_unnamed [3 3]:

:name:group-id:data
{:V3 0.5}0_unnamed [1 5]:
{:V3 1.0}1_unnamed [1 5]:
{:V3 1.5}2_unnamed [1 5]:

Column

You can perform columnar aggreagation also. aggregate-columns selects columns and apply aggregating function (or sequence of functions) for each column separately.

(tc/aggregate-columns DS [:V1 :V2 :V3] #(reduce + %))

_unnamed [1 3]:

:V1:V2:V3
13459.0

(tc/aggregate-columns DS [:V1 :V2 :V3] [#(reduce + %)
                                         #(reduce max %)
                                         #(reduce * %)])

_unnamed [1 3]:

:V1:V2:V3
1390.421875

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns [:V1 :V2 :V3] #(reduce + %)))

_unnamed [3 4]:

:V4:V1:V2:V3
A4121.5
B5153.0
C4184.5

You can also aggregate whole dataset

(-> DS
    (tc/drop-columns :V4)
    (tc/aggregate-columns #(reduce + %)))

_unnamed [1 3]:

:V1:V2:V3
13459.0

Crosstab

Cross tabulation built from two sets of columns. First rows and cols are used to construct grouped dataset, then aggregation function is applied for each pair. By default it counts rows from each group.

Options are:

  • :aggregator - function which aggregates values of grouped dataset, default it’s row-count
  • :marginal-rows and :marginal-cols - if true, sum of rows and cols are added as an additional columns and row. May be custom function which accepts pure row and col as a seq.
  • :replace-missing? - should missing values be replaced (default: true) with :missing-value (default: 0)
  • :pivot? - if false, flat aggregation result is returned (default: false)
(def ctds (tc/dataset {:a [:foo :foo :bar :bar :foo :foo]
                       :b [:one :one :two :one :two :one]
                       :c [:dull :dull :shiny :dull :dull :shiny]}))

#’user/ctds

ctds
_unnamed [6 3]:

|   :a |   :b |     :c |
|------|------|--------|
| :foo | :one |  :dull |
| :foo | :one |  :dull |
| :bar | :two | :shiny |
| :bar | :one |  :dull |
| :foo | :two |  :dull |
| :foo | :one | :shiny |

(tc/crosstab ctds :a [:b :c])

_unnamed [2 5]:

rows/cols[:one :dull][:two :shiny][:two :dull][:one :shiny]
:foo2011
:bar1100

With marginals

(tc/crosstab ctds :a [:b :c] {:marginal-rows true :marginal-cols true})

_unnamed [3 6]:

rows/cols[:one :dull][:two :shiny][:two :dull][:one :shiny]:summary
:foo20114
:bar11002
:summary31116

Set missing value to -1

(tc/crosstab ctds :a [:b :c] {:missing-value -1})

_unnamed [2 5]:

rows/cols[:one :dull][:two :shiny][:two :dull][:one :shiny]
:foo2-111
:bar11-1-1

Turn off pivoting

(tc/crosstab ctds :a [:b :c] {:pivot? false})

_unnamed [5 3]:

:rows:colssummary
:foo[:one :dull]2
:bar[:two :shiny]1
:bar[:one :dull]1
:foo[:two :dull]1
:foo[:one :shiny]1

Order

Ordering can be done by column(s) or any function operating on row. Possible order can be:

  • :asc for ascending order (default)
  • :desc for descending order
  • custom comparator

:select-keys limits row map provided to ordering functions.


Order by single column, ascending

(tc/order-by DS :V1)

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
131.5C
151.0B
170.5A
191.5C
261.5C
240.5A
281.0B
221.0B

Descending order

(tc/order-by DS :V1 :desc)

_unnamed [9 4]:

:V1:V2:V3:V4
221.0B
240.5A
261.5C
281.0B
151.0B
131.5C
170.5A
110.5A
191.5C

Order by two columns

(tc/order-by DS [:V1 :V2])

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
131.5C
151.0B
170.5A
191.5C
221.0B
240.5A
261.5C
281.0B

Use different orders for columns

(tc/order-by DS [:V1 :V2] [:asc :desc])

_unnamed [9 4]:

:V1:V2:V3:V4
191.5C
170.5A
151.0B
131.5C
110.5A
281.0B
261.5C
240.5A
221.0B
(tc/order-by DS [:V1 :V2] [:desc :desc])

_unnamed [9 4]:

:V1:V2:V3:V4
281.0B
261.5C
240.5A
221.0B
191.5C
170.5A
151.0B
131.5C
110.5A
(tc/order-by DS [:V1 :V3] [:desc :asc])

_unnamed [9 4]:

:V1:V2:V3:V4
240.5A
221.0B
281.0B
261.5C
110.5A
170.5A
151.0B
131.5C
191.5C

Custom function can be used to provided ordering key. Here order by :V4 descending, then by product of other columns ascending.

(tc/order-by DS [:V4 (fn [row] (* (:V1 row)
                                  (:V2 row)
                                  (:V3 row)))] [:desc :asc])

_unnamed [9 4]:

:V1:V2:V3:V4
131.5C
191.5C
261.5C
221.0B
151.0B
281.0B
110.5A
170.5A
240.5A

Custom comparator also can be used in case objects are not comparable by default. Let’s define artificial one: if Euclidean distance is lower than 2, compare along z else along x and y. We use first three columns for that.

(defn dist
  [v1 v2]
  (->> v2
       (map - v1)
       (map #(* % %))
       (reduce +)
       (Math/sqrt)))
#'user/dist
(tc/order-by DS [:V1 :V2 :V3] (fn [[x1 y1 z1 :as v1] [x2 y2 z2 :as v2]]
                                (let [d (dist v1 v2)]
                                  (if (< d 2.0)
                                    (compare z1 z2)
                                    (compare [x1 y1] [x2 y2])))))

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
151.0B
170.5A
191.5C
221.0B
240.5A
131.5C
261.5C
281.0B

Unique

Remove rows which contains the same data. By default unique-by removes duplicates from whole dataset. You can also pass list of columns or functions (similar as in group-by) to remove duplicates limited by them. Default strategy is to keep the first row. More strategies below.

unique-by works on groups


Remove duplicates from whole dataset

(tc/unique-by DS)

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Remove duplicates from each group selected by column.

(tc/unique-by DS :V1)

_unnamed [2 4]:

:V1:V2:V3:V4
110.5A
221.0B

Pair of columns

(tc/unique-by DS [:V1 :V3])

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C

Also function can be used, split dataset by modulo 3 on columns :V2

(tc/unique-by DS (fn [m] (mod (:V2 m) 3)))

_unnamed [3 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C

The same can be achived with group-by

(-> DS
    (tc/group-by (fn [m] (mod (:V2 m) 3)))
    (tc/first)
    (tc/ungroup))

_unnamed [3 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C

Grouped dataset

(-> DS
    (tc/group-by :V4)
    (tc/unique-by :V1)
    (tc/ungroup))

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
240.5A
221.0B
151.0B
131.5C
261.5C

Strategies

There are 4 strategies defined:

  • :first - select first row (default)
  • :last - select last row
  • :random - select random row
  • any function - apply function to a columns which are subject of uniqueness

Last

(tc/unique-by DS :V1 {:strategy :last})

_unnamed [2 4]:

:V1:V2:V3:V4
281.0B
191.5C

Random

(tc/unique-by DS :V1 {:strategy :random})

_unnamed [2 4]:

:V1:V2:V3:V4
240.5A
151.0B

Pack columns into vector

(tc/unique-by DS :V4 {:strategy vec})

_unnamed [3 3]:

:V1:V2:V3
[1 2 1][1 4 7][0.5 0.5 0.5]
[2 1 2][2 5 8][1.0 1.0 1.0]
[1 2 1][3 6 9][1.5 1.5 1.5]

Sum columns

(tc/unique-by DS :V4 {:strategy (partial reduce +)})

_unnamed [3 3]:

:V1:V2:V3
4121.5
5153.0
4184.5

Group by function and apply functions

(tc/unique-by DS (fn [m] (mod (:V2 m) 3)) {:strategy vec})

_unnamed [3 4]:

:V1:V2:V3:V4
[1 2 1][1 4 7][0.5 0.5 0.5][“A” “A” “A”]
[2 1 2][2 5 8][1.0 1.0 1.0][“B” “B” “B”]
[1 2 1][3 6 9][1.5 1.5 1.5][“C” “C” “C”]

Grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/unique-by (fn [m] (mod (:V2 m) 3)) {:strategy vec})
    (tc/ungroup {:add-group-as-column :from-V1}))

_unnamed [6 5]:

:from-V1:V1:V2:V3:V4
1[1 1][1 7][0.5 0.5][“A” “A”]
1[1 1][3 9][1.5 1.5][“C” “C”]
1[1][5][1.0][“B”]
2[2 2][2 8][1.0 1.0][“B” “B”]
2[2][4][0.5][“A”]
2[2][6][1.5][“C”]

Missing

When dataset contains missing values you can select or drop rows with missing values or replace them using some strategy.

column-selector can be used to limit considered columns

Let’s define dataset which contains missing values

(def DSm (tc/dataset {:V1 (take 9 (cycle [1 2 nil]))
                      :V2 (range 1 10)
                      :V3 (take 9 (cycle [0.5 1.0 nil 1.5]))
                      :V4 (take 9 (cycle ["A" "B" "C"]))}))
DSm

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
3 C
141.5A
250.5B
61.0C
17 A
281.5B
90.5C

Select

Select rows with missing values

(tc/select-missing DSm)

_unnamed [4 4]:

:V1:V2:V3:V4
3 C
61.0C
17 A
90.5C

Select rows with missing values in :V1

(tc/select-missing DSm :V1)

_unnamed [3 4]:

:V1:V2:V3:V4
3 C
61.0C
90.5C

The same with grouped dataset

(-> DSm
    (tc/group-by :V4)
    (tc/select-missing :V3)
    (tc/ungroup))

_unnamed [2 4]:

:V1:V2:V3:V4
17 A
3 C

Drop

Drop rows with missing values

(tc/drop-missing DSm)

_unnamed [5 4]:

:V1:V2:V3:V4
110.5A
221.0B
141.5A
250.5B
281.5B

Drop rows with missing values in :V1

(tc/drop-missing DSm :V1)

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
221.0B
141.5A
250.5B
17 A
281.5B

The same with grouped dataset

(-> DSm
    (tc/group-by :V4)
    (tc/drop-missing :V1)
    (tc/ungroup))

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
141.5A
17 A
221.0B
250.5B
281.5B

Replace

Missing values can be replaced using several strategies. replace-missing accepts:

  • dataset
  • column selector, default: :all
  • strategy, default: :nearest
  • value (optional)
    • single value
    • sequence of values (cycled)
    • function, applied on column(s) with stripped missings
    • map with [index,value] pairs

Strategies are:

  • :value - replace with given value
  • :up - copy values up
  • :down - copy values down
  • :updown - copy values up and then down for missing values at the end
  • :downup - copy values down and then up for missing values at the beginning
  • :mid or :nearest - copy values around known values
  • :midpoint - use average value from previous and next non-missing
  • :lerp - trying to lineary approximate values, works for numbers and datetime, otherwise applies :nearest. For numbers always results in float datatype.

Let’s define special dataset here:

(def DSm2 (tc/dataset {:a [nil nil nil 1.0 2  nil nil nil nil  nil 4   nil  11 nil nil]
                       :b [2   2   2 nil nil nil nil nil nil 13   nil   3  4  5 5]}))
DSm2

_unnamed [15 2]:

:a:b
2
2
2
1.0
2.0
13
4.0
3
11.04
5
5

Replace missing with default strategy for all columns

(tc/replace-missing DSm2)

_unnamed [15 2]:

:a:b
1.02
1.02
1.02
1.02
2.02
2.02
2.013
2.013
4.013
4.013
4.013
4.03
11.04
11.05
11.05

Replace missing with single value in whole dataset

(tc/replace-missing DSm2 :all :value 999)

_unnamed [15 2]:

:a:b
999.02
999.02
999.02
1.0999
2.0999
999.0999
999.0999
999.0999
999.0999
999.013
4.0999
999.03
11.04
999.05
999.05

Replace missing with single value in :a column

(tc/replace-missing DSm2 :a :value 999)

_unnamed [15 2]:

:a:b
999.02
999.02
999.02
1.0
2.0
999.0
999.0
999.0
999.0
999.013
4.0
999.03
11.04
999.05
999.05

Replace missing with sequence in :a column

(tc/replace-missing DSm2 :a :value [-999 -998 -997])

_unnamed [15 2]:

:a:b
-999.02
-998.02
-997.02
1.0
2.0
-999.0
-998.0
-997.0
-999.0
-998.013
4.0
-997.03
11.04
-999.05
-998.05

Replace missing with a function (mean)

(tc/replace-missing DSm2 :a :value tech.v3.datatype.functional/mean)

_unnamed [15 2]:

:a:b
4.52
4.52
4.52
1.0
2.0
4.5
4.5
4.5
4.5
4.513
4.0
4.53
11.04
4.55
4.55

Replace missing some missing values with a map

(tc/replace-missing DSm2 :a :value {0 100 1 -100 14 -1000})

_unnamed [15 2]:

:a:b
100.02
-100.02
2
1.0
2.0
13
4.0
3
11.04
5
-1000.05

Using :down strategy, fills gaps with values from above. You can see that if missings are at the beginning, the are filled with first value

(tc/replace-missing DSm2 [:a :b] :downup)

_unnamed [15 2]:

:a:b
1.02
1.02
1.02
1.02
2.02
2.02
2.02
2.02
2.02
2.013
4.013
4.03
11.04
11.05
11.05

To fix above issue you can provide value

(tc/replace-missing DSm2 [:a :b] :down 999)

_unnamed [15 2]:

:a:b
999.02
999.02
999.02
1.02
2.02
2.02
2.02
2.02
2.02
2.013
4.013
4.03
11.04
11.05
11.05

The same applies for :up strategy which is opposite direction.

(tc/replace-missing DSm2 [:a :b] :up)

_unnamed [15 2]:

:a:b
1.02
1.02
1.02
1.013
2.013
4.013
4.013
4.013
4.013
4.013
4.03
11.03
11.04
5
5

(tc/replace-missing DSm2 [:a :b] :updown)

_unnamed [15 2]:

:a:b
1.02
1.02
1.02
1.013
2.013
4.013
4.013
4.013
4.013
4.013
4.03
11.03
11.04
11.05
11.05

The same applies for :up strategy which is opposite direction.

(tc/replace-missing DSm2 [:a :b] :midpoint)

_unnamed [15 2]:

:a:b
1.02.0
1.02.0
1.02.0
1.07.5
2.07.5
3.07.5
3.07.5
3.07.5
3.07.5
3.013.0
4.08.0
7.53.0
11.04.0
11.05.0
11.05.0

We can use a function which is applied after applying :up or :down

(tc/replace-missing DSm2 [:a :b] :down tech.v3.datatype.functional/mean)

_unnamed [15 2]:

:a:b
4.52
4.52
4.52
1.02
2.02
2.02
2.02
2.02
2.02
2.013
4.013
4.03
11.04
11.05
11.05

Lerp tries to apply linear interpolation of the values

(tc/replace-missing DSm2 [:a :b] :lerp)

_unnamed [15 2]:

:a:b
1.000000002.00000000
1.000000002.00000000
1.000000002.00000000
1.000000003.57142857
2.000000005.14285714
2.333333336.71428571
2.666666678.28571429
3.000000009.85714286
3.3333333311.42857143
3.6666666713.00000000
4.000000008.00000000
7.500000003.00000000
11.000000004.00000000
11.000000005.00000000
11.000000005.00000000

Lerp works also on dates

(-> (tc/dataset {:dt [(java.time.LocalDateTime/of 2020 1 1 11 22 33)
                      nil nil nil nil nil nil nil
                      (java.time.LocalDateTime/of 2020 10 1 1 1 1)]})
    (tc/replace-missing :lerp))

_unnamed [9 1]:

:dt
2020-01-01T11:22:33
2020-02-04T16:04:51.500
2020-03-09T20:47:10
2020-04-13T01:29:28.500
2020-05-17T06:11:47
2020-06-20T10:54:05.500
2020-07-24T15:36:24
2020-08-27T20:18:42.500
2020-10-01T01:01:01

Inject

When your column contains not continuous data range you can fill up with lacking values. Arguments:

  • dataset
  • column name
  • expected step (max-span, milliseconds in case of datetime column)
  • (optional) missing-strategy - how to replace missing, default :down (set to nil if none)
  • (optional) missing-value - optional value for replace missing

(-> (tc/dataset {:a [1 2 9]
                 :b [:a :b :c]})
    (tc/fill-range-replace :a 1))

_unnamed [9 2]:

:a:b
1.0:a
2.0:b
3.0:b
4.0:b
5.0:b
6.0:b
7.0:b
8.0:b
9.0:c

Join/Separate Columns

Joining or separating columns are operations which can help to tidy messy dataset.

  • join-columns joins content of the columns (as string concatenation or other structure) and stores it in new column
  • separate-column splits content of the columns into set of new columns

Join

join-columns accepts:

  • dataset
  • column selector (as in select-columns)
  • options
  • :separator (default "-")
  • :drop-columns? - whether to drop source columns or not (default true)
  • :result-type
  • :map - packs data into map
  • :seq - packs data into sequence
  • :string - join strings with separator (default)
  • or custom function which gets row as a vector
  • :missing-subst - substitution for missing value

Default usage. Create :joined column out of other columns.

(tc/join-columns DSm :joined [:V1 :V2 :V4])

_unnamed [9 2]:

:V3:joined
0.51-1-A
1.02-2-B
3-C
1.51-4-A
0.52-5-B
1.06-C
1-7-A
1.52-8-B
0.59-C

Without dropping source columns.

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:drop-columns? false})

_unnamed [9 5]:

:V1:V2:V3:V4:joined
110.5A1-1-A
221.0B2-2-B
3 C3-C
141.5A1-4-A
250.5B2-5-B
61.0C6-C
17 A1-7-A
281.5B2-8-B
90.5C9-C

Let’s replace missing value with “NA” string.

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:missing-subst "NA"})

_unnamed [9 2]:

:V3:joined
0.51-1-A
1.02-2-B
NA-3-C
1.51-4-A
0.52-5-B
1.0NA-6-C
1-7-A
1.52-8-B
0.5NA-9-C

We can use custom separator.

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:separator "/"
                                            :missing-subst "."})

_unnamed [9 2]:

:V3:joined
0.51/1/A
1.02/2/B
./3/C
1.51/4/A
0.52/5/B
1.0./6/C
1/7/A
1.52/8/B
0.5./9/C

Or even sequence of separators.

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:separator ["-" "/"]
                                            :missing-subst "."})

_unnamed [9 2]:

:V3:joined
0.51-1/A
1.02-2/B
.-3/C
1.51-4/A
0.52-5/B
1.0.-6/C
1-7/A
1.52-8/B
0.5.-9/C

The other types of results, map:

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type :map})

_unnamed [9 2]:

:V3:joined
0.5{:V1 1, :V2 1, :V4 “A”}
1.0{:V1 2, :V2 2, :V4 “B”}
{:V1 nil, :V2 3, :V4 “C”}
1.5{:V1 1, :V2 4, :V4 “A”}
0.5{:V1 2, :V2 5, :V4 “B”}
1.0{:V1 nil, :V2 6, :V4 “C”}
{:V1 1, :V2 7, :V4 “A”}
1.5{:V1 2, :V2 8, :V4 “B”}
0.5{:V1 nil, :V2 9, :V4 “C”}

Sequence

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type :seq})

_unnamed [9 2]:

:V3:joined
0.5(1 1 “A”)
1.0(2 2 “B”)
(nil 3 “C”)
1.5(1 4 “A”)
0.5(2 5 “B”)
1.0(nil 6 “C”)
(1 7 “A”)
1.5(2 8 “B”)
0.5(nil 9 “C”)

Custom function, calculate hash

(tc/join-columns DSm :joined [:V1 :V2 :V4] {:result-type hash})

_unnamed [9 2]:

:V3:joined
0.5535226087
1.01128801549
-1842240303
1.52022347171
0.51884312041
1.0-1555412370
1640237355
1.5-967279152
0.51128367958

Grouped dataset

(-> DSm
    (tc/group-by :V4)
    (tc/join-columns :joined [:V1 :V2 :V4])
    (tc/ungroup))

_unnamed [9 2]:

:V3:joined
0.51-1-A
1.51-4-A
1-7-A
1.02-2-B
0.52-5-B
1.52-8-B
3-C
1.06-C
0.59-C

Tidyr examples

source

(def df (tc/dataset {:x ["a" "a" nil nil]
                      :y ["b" nil "b" nil]}))
#'user/df
df

_unnamed [4 2]:

:x:y
ab
a
b

(tc/join-columns df "z" [:x :y] {:drop-columns? false
                                  :missing-subst "NA"
                                  :separator "_"})

_unnamed [4 3]:

:x:yz
aba_b
a a_NA
bNA_b
NA_NA

(tc/join-columns df "z" [:x :y] {:drop-columns? false
                                  :separator "_"})

_unnamed [4 3]:

:x:yz
aba_b
a a
bb

Separate

Column can be also separated into several other columns using string as separator, regex or custom function. Arguments:

  • dataset
  • source column
  • target columns - can be nil or :infer to automatically create columns
  • separator as:
    • string - it’s converted to regular expression and passed to clojure.string/split function
    • regex
    • or custom function (default: identity)
  • options
    • :drop-column? - whether drop source column(s) or not (default: true). Set to :all to keep only separation result.
    • :missing-subst - values which should be treated as missing, can be set, sequence, value or function (default: "")

Custom function (as separator) should return seqence of values for given value or a sequence of map.


Separate float into integer and factional values

(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
                                                     [(int (quot v 1.0))
                                                      (mod v 1.0)]))

_unnamed [9 5]:

:V1:V2:int-part:frac-part:V4
1100.5A
2210.0B
1310.5C
2400.5A
1510.0B
2610.5C
1700.5A
2810.0B
1910.5C

Source column can be kept

(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
                                                     [(int (quot v 1.0))
                                                      (mod v 1.0)]) {:drop-column? false})

_unnamed [9 6]:

:V1:V2:V3:int-part:frac-part:V4
110.500.5A
221.010.0B
131.510.5C
240.500.5A
151.010.0B
261.510.5C
170.500.5A
281.010.0B
191.510.5C

We can treat 0 or 0.0 as missing value

(tc/separate-column DS :V3 [:int-part :frac-part] (fn [^double v]
                                                     [(int (quot v 1.0))
                                                      (mod v 1.0)]) {:missing-subst [0 0.0]})

_unnamed [9 5]:

:V1:V2:int-part:frac-part:V4
11 0.5A
221 B
1310.5C
24 0.5A
151 B
2610.5C
17 0.5A
281 B
1910.5C

Works on grouped dataset

(-> DS
    (tc/group-by :V4)
    (tc/separate-column :V3 [:int-part :fract-part] (fn [^double v]
                                                       [(int (quot v 1.0))
                                                        (mod v 1.0)]))
    (tc/ungroup))

_unnamed [9 5]:

:V1:V2:int-part:fract-part:V4
1100.5A
2400.5A
1700.5A
2210.0B
1510.0B
2810.0B
1310.5C
2610.5C
1910.5C

Separate using separator returning sequence of maps.

(tc/separate-column DS :V3 (fn [^double v]
                              {:int-part (int (quot v 1.0))
                               :fract-part (mod v 1.0)}))

_unnamed [9 5]:

:V1:V2:int-part:fract-part:V4
1100.5A
2210.0B
1310.5C
2400.5A
1510.0B
2610.5C
1700.5A
2810.0B
1910.5C

Keeping all columns

(tc/separate-column DS :V3 nil (fn [^double v]
                                  {:int-part (int (quot v 1.0))
                                   :fract-part (mod v 1.0)}) {:drop-column? false})

_unnamed [9 6]:

:V1:V2:V3:int-part:fract-part:V4
110.500.5A
221.010.0B
131.510.5C
240.500.5A
151.010.0B
261.510.5C
170.500.5A
281.010.0B
191.510.5C

Droping all colums but separated

(tc/separate-column DS :V3 nil (fn [^double v]
                                 {:int-part (int (quot v 1.0))
                                  :fract-part (mod v 1.0)}) {:drop-column? :all})

_unnamed [9 2]:

:int-part:fract-part
00.5
10.0
10.5
00.5
10.0
10.5
00.5
10.0
10.5

Infering column names

(tc/separate-column DS :V3 (fn [^double v]
                             [(int (quot v 1.0)) (mod v 1.0)]))

_unnamed [9 5]:

:V1:V2:V3-0:V3-1:V4
1100.5A
2210.0B
1310.5C
2400.5A
1510.0B
2610.5C
1700.5A
2810.0B
1910.5C

Join and separate together.

(-> DSm
    (tc/join-columns :joined [:V1 :V2 :V4] {:result-type :map})
    (tc/separate-column :joined [:v1 :v2 :v4] (juxt :V1 :V2 :V4)))

_unnamed [9 4]:

:V3:v1:v2:v4
0.511A
1.022B
3C
1.514A
0.525B
1.0 6C
17A
1.528B
0.5 9C
(-> DSm
    (tc/join-columns :joined [:V1 :V2 :V4] {:result-type :seq})
    (tc/separate-column :joined [:v1 :v2 :v4] identity))

_unnamed [9 4]:

:V3:v1:v2:v4
0.511A
1.022B
3C
1.514A
0.525B
1.0 6C
17A
1.528B
0.5 9C
Tidyr examples

separate source extract source

(def df-separate (tc/dataset {:x [nil "a.b" "a.d" "b.c"]}))
(def df-separate2 (tc/dataset {:x ["a" "a b" nil "a b c"]}))
(def df-separate3 (tc/dataset {:x ["a?b" nil "a.b" "b:c"]}))
(def df-extract (tc/dataset {:x [nil "a-b" "a-d" "b-c" "d-e"]}))
#'user/df-separate
#'user/df-separate2
#'user/df-separate3
#'user/df-extract
df-separate

_unnamed [4 1]:

:x
a.b
a.d
b.c
df-separate2

_unnamed [4 1]:

:x
a
a b
a b c
df-separate3

_unnamed [4 1]:

:x
a?b
a.b
b:c
df-extract

_unnamed [5 1]:

:x
a-b
a-d
b-c
d-e

(tc/separate-column df-separate :x [:A :B] "\\.")

_unnamed [4 2]:

:A:B
ab
ad
bc

You can drop columns after separation by setting nil as a name. We need second value here.

(tc/separate-column df-separate :x [nil :B] "\\.")

_unnamed [4 1]:

:B
b
d
c

Extra data is dropped

(tc/separate-column df-separate2 :x ["a" "b"] " ")

_unnamed [4 2]:

ab
a
ab
ab

Split with regular expression

(tc/separate-column df-separate3 :x ["a" "b"] "[?\\.:]")

_unnamed [4 2]:

ab
ab
ab
bc

Or just regular expression to extract values

(tc/separate-column df-separate3 :x ["a" "b"] #"(.).(.)")

_unnamed [4 2]:

ab
ab
ab
bc

Extract first value only

(tc/separate-column df-extract :x ["A"] "-")

_unnamed [5 1]:

A
a
a
b
d

Split with regex

(tc/separate-column df-extract :x ["A" "B"] #"(\p{Alnum})-(\p{Alnum})")

_unnamed [5 2]:

AB
ab
ad
bc
de

Only a,b,c,d strings

(tc/separate-column df-extract :x ["A" "B"] #"([a-d]+)-([a-d]+)")

_unnamed [5 2]:

AB
ab
ad
bc

Array column conversion

A dataset can have as well columns of type java array. We can convert from normal columns to a single array column and back like this:

(-> (tc/dataset {:x [(double-array [1 2 3])
                     (double-array [4 5 6])]
                 :y [:a :b]})
    (tc/array-column->columns :x))
_unnamed [2 4]:

| :y |   0 |   1 |   2 |
|----|----:|----:|----:|
| :a | 1.0 | 2.0 | 3.0 |
| :b | 4.0 | 5.0 | 6.0 |

and the other way around:

(-> (tc/dataset {0 [0.0 1 2]
                 1 [3.0 4 5]
                 :x [:a :b :c]})
    (tc/columns->array-column [0 1] :y))
_unnamed [3 2]:

| :x |          :y |
|----|-------------|
| :a | [D@28ce010b |
| :b | [D@2bf74fe4 |
| :c | [D@37138b9c |

Fold/Unroll Rows

To pack or unpack the data into single value you can use fold-by and unroll functions.

fold-by groups dataset and packs columns data from each group separately into desired datastructure (like vector or sequence). unroll does the opposite.

Fold-by

Group-by and pack columns into vector

(tc/fold-by DS [:V3 :V4 :V1])

_unnamed [6 4]:

:V3:V4:V1:V2
0.5A1[1 7]
1.0B2[2 8]
1.5C1[3 9]
0.5A2[4]
1.0B1[5]
1.5C2[6]

You can pack several columns at once.

(tc/fold-by DS [:V4])

_unnamed [3 4]:

:V4:V1:V2:V3
A[1 2 1][1 4 7][0.5 0.5 0.5]
B[2 1 2][2 5 8][1.0 1.0 1.0]
C[1 2 1][3 6 9][1.5 1.5 1.5]

You can use custom packing function

(tc/fold-by DS [:V4] seq)

_unnamed [3 4]:

:V4:V1:V2:V3
A(1 2 1)(1 4 7)(0.5 0.5 0.5)
B(2 1 2)(2 5 8)(1.0 1.0 1.0)
C(1 2 1)(3 6 9)(1.5 1.5 1.5)

or

(tc/fold-by DS [:V4] set)

_unnamed [3 4]:

:V4:V1:V2:V3
A#{1 2}#{7 1 4}#{0.5}
B#{1 2}#{2 5 8}#{1.0}
C#{1 2}#{6 3 9}#{1.5}

This works also on grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/fold-by :V4)
    (tc/ungroup))

_unnamed [6 4]:

:V4:V1:V2:V3
A[1 1][1 7][0.5 0.5]
C[1 1][3 9][1.5 1.5]
B[1][5][1.0]
B[2 2][2 8][1.0 1.0]
A[2][4][0.5]
C[2][6][1.5]

Unroll

unroll unfolds sequences stored in data, multiplying other ones when necessary. You can unroll more than one column at once (folded data should have the same size!).

Options:

  • :indexes? if true (or column name), information about index of unrolled sequence is added.
  • :datatypes list of datatypes which should be applied to restored columns, a map

Unroll one column

(tc/unroll (tc/fold-by DS [:V4]) [:V1])

_unnamed [9 4]:

:V4:V2:V3:V1
A[1 4 7][0.5 0.5 0.5]1
A[1 4 7][0.5 0.5 0.5]2
A[1 4 7][0.5 0.5 0.5]1
B[2 5 8][1.0 1.0 1.0]2
B[2 5 8][1.0 1.0 1.0]1
B[2 5 8][1.0 1.0 1.0]2
C[3 6 9][1.5 1.5 1.5]1
C[3 6 9][1.5 1.5 1.5]2
C[3 6 9][1.5 1.5 1.5]1

Unroll all folded columns

(tc/unroll (tc/fold-by DS [:V4]) [:V1 :V2 :V3])

_unnamed [9 4]:

:V4:V1:V2:V3
A110.5
A240.5
A170.5
B221.0
B151.0
B281.0
C131.5
C261.5
C191.5

Unroll one by one leads to cartesian product

(-> DS
    (tc/fold-by [:V4 :V1])
    (tc/unroll [:V2])
    (tc/unroll [:V3]))

_unnamed [15 4]:

:V4:V1:V2:V3
A110.5
A110.5
A170.5
A170.5
B221.0
B221.0
B281.0
B281.0
C131.5
C131.5
C191.5
C191.5
A240.5
B151.0
C261.5

You can add indexes

(tc/unroll (tc/fold-by DS [:V1]) [:V4 :V2 :V3] {:indexes? true})

_unnamed [9 5]:

:V1:indexes:V4:V2:V3
10A10.5
11C31.5
12B51.0
13A70.5
14C91.5
20B21.0
21A40.5
22C61.5
23B81.0
(tc/unroll (tc/fold-by DS [:V1]) [:V4 :V2 :V3] {:indexes? "vector idx"})

_unnamed [9 5]:

:V1vector idx:V4:V2:V3
10A10.5
11C31.5
12B51.0
13A70.5
14C91.5
20B21.0
21A40.5
22C61.5
23B81.0

You can also force datatypes

(-> DS
    (tc/fold-by [:V1])
    (tc/unroll [:V4 :V2 :V3] {:datatypes {:V4 :string
                                           :V2 :int16
                                           :V3 :float32}})
    (tc/info :columns))

_unnamed :column info [4 4]:

:name:datatype:n-elems:categorical?
:V1:int649
:V4:string9true
:V2:int169
:V3:float329

This works also on grouped dataset

(-> DS
    (tc/group-by :V1)
    (tc/fold-by [:V1 :V4])
    (tc/unroll :V3 {:indexes? true})
    (tc/ungroup))

_unnamed [9 5]:

:V1:V4:V2:indexes:V3
1A[1 7]00.5
1A[1 7]10.5
1C[3 9]01.5
1C[3 9]11.5
1B[5]01.0
2B[2 8]01.0
2B[2 8]11.0
2A[4]00.5
2C[6]01.5

Reshape

Reshaping data provides two types of operations:

  • pivot->longer - converting columns to rows
  • pivot->wider - converting rows to columns

Both functions are inspired on tidyr R package and provide almost the same functionality.

All examples are taken from mentioned above documentation.

Both functions work only on regular dataset.

Longer

pivot->longer converts columns to rows. Column names are treated as data.

Arguments:

  • dataset
  • columns selector
  • options:
    • :target-columns - names of the columns created or columns pattern (see below) (default: :$column)
    • :value-column-name - name of the column for values (default: :$value)
    • :splitter - string, regular expression or function which splits source column names into data
    • :drop-missing? - remove rows with missing? (default: true)
    • :datatypes - map of target columns data types
    • :coerce-to-number - try to convert extracted values to numbers if possible (default: true)

:target-columns - can be:

  • column name - source columns names are put there as a data
  • column names as seqence - source columns names after split are put separately into :target-columns as data
  • pattern - is a sequence of names, where some of the names are nil. nil is replaced by a name taken from splitter and such column is used for values.

Create rows from all columns but "religion".

(def relig-income (tc/dataset "data/relig_income.csv"))
relig-income

data/relig_income.csv [18 11]:

religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k$75-100k$100-150k>150kDon’t know/refused
Agnostic27346081761371221098496
Atheist12273752357073597476
Buddhist27213034335862395354
Catholic41861773267063811169497926331489
Don’t know/refused151415111035211718116
Evangelical Prot575869106498288114869497234141529
Hindu1979113447485437
Historically Black Prot2282442362381972231318178339
Jehovah’s Witness2027242421301511637
Jewish1919252530956987151162
Mainline Prot28949561965565111079397536341328
Mormon294048515611285494269
Muslim67910923168622
Orthodox13172332324738424673
Other Christian971113131418141218
Other Faiths20334046496346404171
Other World Religions5234273448
Unaffiliated217299374365341528407321258597
(tc/pivot->longer relig-income (complement #{"religion"}))

data/relig_income.csv [180 3]:

| religion | :$column | :$value | | |-------------------------|--------------------|----:| | Agnostic | <$10k | 27 | | Atheist | <$10k | 12 | | Buddhist | <$10k | 27 | | Catholic | <$10k | 418 | | Don’t know/refused | <$10k | 15 | | Evangelical Prot | <$10k | 575 | | Hindu | <$10k | 1 | | Historically Black Prot | <$10k | 228 | | Jehovah’s Witness | <$10k | 20 | | Jewish | <$10k | 19 | | … | … | … | | Historically Black Prot | >150k | 78 | | Jehovah’s Witness | >150k | 6 | | Jewish | >150k | 151 | | Mainline Prot | >150k | 634 | | Mormon | >150k | 42 | | Muslim | >150k | 6 | | Orthodox | >150k | 46 | | Other Christian | >150k | 12 | | Other Faiths | >150k | 41 | | Other World Religions | >150k | 4 | | Unaffiliated | >150k | 258 |


Convert only columns starting with "wk" and pack them into :week column, values go to :rank column

(def bilboard (-> (tc/dataset "data/billboard.csv.gz")
                  (tc/drop-columns :type/boolean))) ;; drop some boolean columns, tidyr just skips them
(->> bilboard
     (tc/column-names)
     (take 13)
     (tc/select-columns bilboard))

data/billboard.csv.gz [317 13]:

artisttrackdate.enteredwk1wk2wk3wk4wk5wk6wk7wk8wk9wk10
2 PacBaby Don’t Cry (Keep…2000-02-2687827277879499
2Ge+herThe Hardest Part Of …2000-09-02918792
3 Doors DownKryptonite2000-04-0881706867665754535151
3 Doors DownLoser2000-10-2176767269676555596261
504 BoyzWobble Wobble2000-04-1557342517173136495357
98^0Give Me Just One Nig…2000-08-195139342626192236
A*TeensDancing Queen2000-07-0897979695100
AaliyahI Don’t Wanna2000-01-2984625141383535383836
AaliyahTry Again2000-03-1859533828211816141210
Adams, YolandaOpen My Heart2000-08-2676767469686761585759
Wallflowers, TheSleepwalker2000-10-28737374809096
WestlifeSwear It Again2000-04-0196826655554644443735
Williams, RobbieAngels1999-11-2085776969625656645453
Wills, MarkBack At One2000-01-1589555143373736394246
Worley, DarrylWhen You Need My Lov…2000-06-1798889392858584808080
Wright, ChelyIt Was2000-03-0486787572716964758598
Yankee GreyAnother Nine Minutes2000-04-298683777483798895
Yearwood, TrishaReal Live Woman2000-04-01858383828191
Ying Yang TwinsWhistle While You Tw…2000-03-1895949185847874788589
Zombie NationKernkraft 4002000-09-029999
matchbox twentyBent2000-04-2960372924222118161312
(tc/pivot->longer bilboard #(clojure.string/starts-with? % "wk") {:target-columns :week
                                                                   :value-column-name :rank})

data/billboard.csv.gz [5307 5]:

artisttrackdate.entered:week:rank
3 Doors DownKryptonite2000-04-08wk354
Braxton, ToniHe Wasn’t Man Enough2000-03-18wk3534
CreedHigher1999-09-11wk3522
CreedWith Arms Wide Open2000-05-13wk355
Hill, FaithBreathe1999-11-06wk358
JoeI Wanna Know2000-01-01wk355
LonestarAmazed1999-06-05wk3514
Vertical HorizonEverything You Want2000-01-22wk3527
matchbox twentyBent2000-04-29wk3533
CreedHigher1999-09-11wk5521
Savage GardenI Knew I Loved You1999-10-23wk2412
SisqoIncomplete2000-06-24wk2431
SisqoThong Song2000-01-29wk2417
Smash MouthThen The Morning Com…1999-10-30wk2435
Son By FourA Puro Dolor (Purest…2000-04-08wk2432
SoniqueIt Feels So Good2000-01-22wk2449
SoulDecisionFaded2000-07-08wk2450
StingDesert Rose2000-05-13wk2445
TrainMeet Virginia1999-10-09wk2442
Vertical HorizonEverything You Want2000-01-22wk246
matchbox twentyBent2000-04-29wk249

We can create numerical column out of column names

(tc/pivot->longer bilboard #(clojure.string/starts-with? % "wk") {:target-columns :week
                                                                   :value-column-name :rank
                                                                   :splitter #"wk(.*)"
                                                                   :datatypes {:week :int16}})

data/billboard.csv.gz [5307 5]:

artisttrackdate.entered:week:rank
3 Doors DownKryptonite2000-04-084621
CreedHigher1999-09-11467
CreedWith Arms Wide Open2000-05-134637
Hill, FaithBreathe1999-11-064631
LonestarAmazed1999-06-05465
3 Doors DownKryptonite2000-04-085142
CreedHigher1999-09-115114
Hill, FaithBreathe1999-11-065149
LonestarAmazed1999-06-055112
2 PacBaby Don’t Cry (Keep…2000-02-26694
matchbox twentyBent2000-04-29522
3 Doors DownKryptonite2000-04-08343
Braxton, ToniHe Wasn’t Man Enough2000-03-183433
CreedHigher1999-09-113423
CreedWith Arms Wide Open2000-05-13345
Hill, FaithBreathe1999-11-06345
JoeI Wanna Know2000-01-01348
LonestarAmazed1999-06-053417
Nelly(Hot S**t) Country G…2000-04-293449
Vertical HorizonEverything You Want2000-01-223420
matchbox twentyBent2000-04-293430

When column names contain observation data, such column names can be splitted and data can be restored into separate columns.

(def who (tc/dataset "data/who.csv.gz"))
(->> who
     (tc/column-names)
     (take 10)
     (tc/select-columns who))

data/who.csv.gz [7240 10]:

countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564
AfghanistanAFAFG1980
AfghanistanAFAFG1981
AfghanistanAFAFG1982
AfghanistanAFAFG1983
AfghanistanAFAFG1984
AfghanistanAFAFG1985
AfghanistanAFAFG1986
AfghanistanAFAFG1987
AfghanistanAFAFG1988
AfghanistanAFAFG1989
ZimbabweZWZWE200313387430482228981367
ZimbabweZWZWE2004187833290822981056366
ZimbabweZWZWE200521083722641855762295
ZimbabweZWZWE200621573623911939896348
ZimbabweZWZWE200713850036930716292
ZimbabweZWZWE200812761403316704263
ZimbabweZWZWE2009125578 3471681293
ZimbabweZWZWE201015071022081682761350
ZimbabweZWZWE201115278424672071780377
ZimbabweZWZWE201212078324212086796360
ZimbabweZWZWE2013
(tc/pivot->longer who #(clojure.string/starts-with? % "new") {:target-columns [:diagnosis :gender :age]
                                                               :splitter #"new_?(.*)_(.)(.*)"
                                                               :value-column-name :count})

data/who.csv.gz [76046 8]:

countryiso2iso3year:diagnosis:gender:age:count
AlbaniaALALB2013relm152460
AlgeriaDZDZA2013relm15241021
AndorraADAND2013relm15240
AngolaAOAGO2013relm15242992
AnguillaAIAIA2013relm15240
Antigua and BarbudaAGATG2013relm15241
ArgentinaARARG2013relm15241124
ArmeniaAMARM2013relm1524116
AustraliaAUAUS2013relm1524105
AustriaATAUT2013relm152444
United Arab EmiratesAEARE2013relm25349
United Kingdom of Great Britain and Northern IrelandGBGBR2013relm25341158
United States of AmericaUSUSA2013relm2534829
UruguayUYURY2013relm2534142
UzbekistanUZUZB2013relm25342371
VanuatuVUVUT2013relm25349
Venezuela (Bolivarian Republic of)VEVEN2013relm2534739
Viet NamVNVNM2013relm25346302
YemenYEYEM2013relm25341113
ZambiaZMZMB2013relm25347808
ZimbabweZWZWE2013relm25345331

When data contains multiple observations per row, we can use splitter and pattern for target columns to create new columns and put values there. In following dataset we have two obseravations dob and gender for two childs. We want to put child infomation into the column and leave dob and gender for values.

(def family (tc/dataset "data/family.csv"))
family

data/family.csv [5 5]:

familydob_child1dob_child2gender_child1gender_child2
11998-11-262000-01-2912
21996-06-22 2
32002-07-112004-04-0522
42004-10-102009-08-2711
52000-12-052005-02-2821
(tc/pivot->longer family (complement #{"family"}) {:target-columns [nil :child]
                                                    :splitter "_"
                                                    :datatypes {"gender" :int16}})

data/family.csv [9 4]:

family:childdobgender
1child11998-11-261
2child11996-06-222
3child12002-07-112
4child12004-10-101
5child12000-12-052
1child22000-01-292
3child22004-04-052
4child22009-08-271
5child22005-02-281

Similar here, we have two observations: x and y in four groups.

(def anscombe (tc/dataset "data/anscombe.csv"))
anscombe

data/anscombe.csv [11 8]:

x1x2x3x4y1y2y3y4
10101088.049.147.466.58
88886.958.146.775.76
13131387.588.7412.747.71
99988.818.777.118.84
11111188.339.267.818.47
14141489.968.108.847.04
66687.246.136.085.25
444194.263.105.3912.50
121212810.849.138.155.56
77784.827.266.427.91
55585.684.745.736.89
(tc/pivot->longer anscombe :all {:splitter #"(.)(.)"
                                  :target-columns [nil :set]})

data/anscombe.csv [44 3]:

:setxy
1108.04
186.95
1137.58
198.81
1118.33
1149.96
167.24
144.26
11210.84
174.82
486.58
485.76
487.71
488.84
488.47
487.04
485.25
41912.50
485.56
487.91
486.89

(def pnl (tc/dataset {:x [1 2 3 4]
                       :a [1 1 0 0]
                       :b [0 1 1 1]
                       :y1 (repeatedly 4 rand)
                       :y2 (repeatedly 4 rand)
                       :z1 [3 3 3 3]
                       :z2 [-2 -2 -2 -2]}))
pnl

_unnamed [4 7]:

:x:a:b:y1:y2:z1:z2
1100.610195370.807655143-2
2110.723646220.094521773-2
3010.180973360.013968823-2
4010.239583210.785436033-2
(tc/pivot->longer pnl [:y1 :y2 :z1 :z2] {:target-columns [nil :times]
                                          :splitter #":(.)(.)"})

_unnamed [8 6]:

:x:a:b:timesyz
11010.610195373
21110.723646223
30110.180973363
40110.239583213
11020.80765514-2
21120.09452177-2
30120.01396882-2
40120.78543603-2

Wider

pivot->wider converts rows to columns.

Arguments:

  • dataset
  • columns-selector - values from selected columns are converted to new columns
  • value-columns - what are values

When multiple columns are used as columns selector, names are joined using :concat-columns-with option. :concat-columns-with can be a string or function (default: “_“). Function accepts sequence of names.

When columns-selector creates non unique set of values, they are folded using :fold-fn (default: vec) option.

When value-columns is a sequence, multiple observations as columns are created appending value column names into new columns. Column names are joined using :concat-value-with option. :concat-value-with can be a string or function (default: “-”). Function accepts current column name and value.


Use station as a name source for columns and seen for values

(def fish (tc/dataset "data/fish_encounters.csv"))
fish

data/fish_encounters.csv [114 3]:

fishstationseen
4842Release1
4842I80_11
4842Lisbon1
4842Rstr1
4842Base_TD1
4842BCE1
4842BCW1
4842BCE21
4842BCW21
4842MAE1
4862BCE1
4862BCW1
4862BCE21
4862BCW21
4863Release1
4863I80_11
4864Release1
4864I80_11
4865Release1
4865I80_11
4865Lisbon1
(tc/pivot->wider fish "station" "seen" {:drop-missing? false})

data/fish_encounters.csv [19 12]:

fishReleaseI80_1LisbonRstrBase_TDBCEBCWBCE2BCW2MAEMAW
484211111111111
484311111111111
484411111111111
485811111111111
486111111111111
4857111111111
4862111111111
485011 1111
484511111
485511111
485911111
48481111
4847111
4865111
484911
485111
485411
486311
486411

If selected columns contain multiple values, such values should be folded.

(def warpbreaks (tc/dataset "data/warpbreaks.csv"))
warpbreaks

data/warpbreaks.csv [54 3]:

breakswooltension
26AL
30AL
54AL
25AL
70AL
52AL
51AL
26AL
67AL
18AM
39BM
29BM
20BH
21BH
24BH
17BH
13BH
15BH
15BH
16BH
28BH

Let’s see how many values are for each type of wool and tension groups

(-> warpbreaks
    (tc/group-by ["wool" "tension"])
    (tc/aggregate {:n tc/row-count}))

_unnamed [6 3]:

wooltension:n
AL9
AM9
AH9
BL9
BM9
BH9
(-> warpbreaks
    (tc/reorder-columns ["wool" "tension" "breaks"])
    (tc/pivot->wider "wool" "breaks" {:fold-fn vec}))

data/warpbreaks.csv [3 3]:

tensionAB
L[26 30 54 25 70 52 51 26 67][27 14 29 19 29 31 41 20 44]
M[18 21 29 17 12 18 35 30 36][42 26 19 16 39 28 21 39 29]
H[36 21 24 18 10 43 28 15 26][20 21 24 17 13 15 15 16 28]

We can also calculate mean (aggreate values)

(-> warpbreaks
    (tc/reorder-columns ["wool" "tension" "breaks"])
    (tc/pivot->wider "wool" "breaks" {:fold-fn tech.v3.datatype.functional/mean}))

data/warpbreaks.csv [3 3]:

tensionAB
L44.5555555628.22222222
M24.0000000028.77777778
H24.5555555618.77777778

Multiple source columns, joined with default separator.

(def production (tc/dataset "data/production.csv"))
production

data/production.csv [45 4]:

productcountryyearproduction
AAI20001.63727158
AAI20010.15870784
AAI2002-1.56797745
AAI2003-0.44455509
AAI2004-0.07133701
AAI20051.61183090
AAI2006-0.70434682
AAI2007-1.53550542
AAI20080.83907155
AAI2009-0.37424110
BEI20040.62564999
BEI2005-1.34530299
BEI2006-0.97184975
BEI2007-1.69715821
BEI20080.04556128
BEI20091.19315043
BEI2010-1.60557503
BEI2011-0.77235497
BEI2012-2.50262738
BEI2013-1.62753769
BEI20140.03329645
(tc/pivot->wider production ["product" "country"] "production")

data/production.csv [15 4]:

yearA_AIB_AIB_EI
20001.63727158-0.026176611.40470848
20010.15870784-0.68863576-0.59618369
2002-1.567977450.06248741-0.26568579
2003-0.44455509-0.723396860.65257808
2004-0.071337010.472489520.62564999
20051.61183090-0.94173861-1.34530299
2006-0.70434682-0.34782108-0.97184975
2007-1.535505420.52425284-1.69715821
20080.839071551.832309370.04556128
2009-0.374241100.107064911.19315043
2010-0.71158926-0.32903664-1.60557503
20111.12805634-1.78319121-0.77235497
20121.457182470.61125798-2.50262738
2013-1.55934101-0.78526092-1.62753769
2014-0.116958380.978436350.03329645

Joined with custom function

(tc/pivot->wider production ["product" "country"] "production" {:concat-columns-with vec})

data/production.csv [15 4]:

year[“A” “AI”][“B” “AI”][“B” “EI”]
20001.63727158-0.026176611.40470848
20010.15870784-0.68863576-0.59618369
2002-1.567977450.06248741-0.26568579
2003-0.44455509-0.723396860.65257808
2004-0.071337010.472489520.62564999
20051.61183090-0.94173861-1.34530299
2006-0.70434682-0.34782108-0.97184975
2007-1.535505420.52425284-1.69715821
20080.839071551.832309370.04556128
2009-0.374241100.107064911.19315043
2010-0.71158926-0.32903664-1.60557503
20111.12805634-1.78319121-0.77235497
20121.457182470.61125798-2.50262738
2013-1.55934101-0.78526092-1.62753769
2014-0.116958380.978436350.03329645

Multiple value columns

(def income (tc/dataset "data/us_rent_income.csv"))
income

data/us_rent_income.csv [104 5]:

GEOIDNAMEvariableestimatemoe
1Alabamaincome24476136
1Alabamarent7473
2Alaskaincome32940508
2Alaskarent120013
4Arizonaincome27517148
4Arizonarent9724
5Arkansasincome23789165
5Arkansasrent7095
6Californiaincome29454109
6Californiarent13583
51Virginiarent11665
53Washingtonincome32318113
53Washingtonrent11204
54West Virginiaincome23707203
54West Virginiarent6816
55Wisconsinincome29868135
55Wisconsinrent8133
56Wyomingincome30854342
56Wyomingrent82811
72Puerto Ricoincome
72Puerto Ricorent4646
(tc/pivot->wider income "variable" ["estimate" "moe"] {:drop-missing? false})

data/us_rent_income.csv [52 6]:

GEOIDNAMEincome-estimateincome-moerent-estimaterent-moe
1Alabama244761367473
2Alaska32940508120013
4Arizona275171489724
5Arkansas237891657095
6California2945410913583
8Colorado3240110911255
9Connecticut3532619511235
10Delaware31560247107610
11District of Columbia43198681142417
12Florida259527010773
46South Dakota288212766967
47Tennessee254531028084
48Texas280631109522
49Utah279282399486
50Vermont2935136194511
51Virginia3254520211665
53Washington3231811311204
54West Virginia237072036816
55Wisconsin298681358133
56Wyoming3085434282811
72Puerto Rico 4646

Value concatenated by custom function

(tc/pivot->wider income "variable" ["estimate" "moe"] {:concat-columns-with vec
                                                        :concat-value-with vector
                                                        :drop-missing? false})

data/us_rent_income.csv [52 6]:

GEOIDNAME[“income” “estimate”][“income” “moe”][“rent” “estimate”][“rent” “moe”]
1Alabama244761367473
2Alaska32940508120013
4Arizona275171489724
5Arkansas237891657095
6California2945410913583
8Colorado3240110911255
9Connecticut3532619511235
10Delaware31560247107610
11District of Columbia43198681142417
12Florida259527010773
46South Dakota288212766967
47Tennessee254531028084
48Texas280631109522
49Utah279282399486
50Vermont2935136194511
51Virginia3254520211665
53Washington3231811311204
54West Virginia237072036816
55Wisconsin298681358133
56Wyoming3085434282811
72Puerto Rico 4646

Reshape contact data

(def contacts (tc/dataset "data/contacts.csv"))
contacts

data/contacts.csv [6 3]:

fieldvalueperson_id
nameJiena McLellan1
companyToyota1
nameJohn Smith2
companygoogle2
emailjohn@google.com2
nameHuxley Ratcliffe3
(tc/pivot->wider contacts "field" "value" {:drop-missing? false})

data/contacts.csv [3 4]:

person_idnamecompanyemail
2John Smithgooglejohn@google.com
1Jiena McLellanToyota
3Huxley Ratcliffe

Reshaping

A couple of tidyr examples of more complex reshaping.


World bank

(def world-bank-pop (tc/dataset "data/world_bank_pop.csv.gz"))
(->> world-bank-pop
     (tc/column-names)
     (take 8)
     (tc/select-columns world-bank-pop))

data/world_bank_pop.csv.gz [1056 8]:

countryindicator200020012002200320042005
ABWSP.URB.TOTL4.24440000E+044.30480000E+044.36700000E+044.42460000E+044.46690000E+044.48890000E+04
ABWSP.URB.GROW1.18263237E+001.41302122E+001.43455953E+001.31036044E+009.51477684E-014.91302715E-01
ABWSP.POP.TOTL9.08530000E+049.28980000E+049.49920000E+049.70170000E+049.87370000E+041.00031000E+05
ABWSP.POP.GROW2.05502678E+002.22593013E+002.22905605E+002.10935434E+001.75735287E+001.30203884E+00
AFGSP.URB.TOTL4.43629900E+064.64805500E+064.89295100E+065.15568600E+065.42677000E+065.69182300E+06
AFGSP.URB.GROW3.91222846E+004.66283822E+005.13467454E+005.23045853E+005.12439302E+004.76864700E+00
AFGSP.POP.TOTL2.00937560E+072.09664630E+072.19799230E+072.30648510E+072.41189790E+072.50707980E+07
AFGSP.POP.GROW3.49465874E+004.25150411E+004.72052846E+004.81804112E+004.46891840E+003.87047016E+00
AGOSP.URB.TOTL8.23476600E+068.70800000E+069.21878700E+069.76519700E+061.03435060E+071.09494240E+07
AGOSP.URB.GROW5.43749411E+005.58771954E+005.70013237E+005.75812711E+005.75341450E+005.69279690E+00
ZAFSP.URB.GROW2.32229180E+002.26080492E+002.29242659E+002.25719919E+002.18014731E+002.09725981E+00
ZAFSP.POP.TOTL4.57283150E+074.63850060E+074.70261730E+074.76487270E+074.82473950E+074.88205860E+07
ZAFSP.POP.GROW1.47499416E+001.42585702E+001.37280586E+001.31515951E+001.24859226E+001.18102315E+00
ZMBSP.URB.TOTL3.66507600E+063.78866000E+063.94496500E+064.10631700E+064.27387500E+064.44857100E+06
ZMBSP.URB.GROW1.50532147E+003.31633227E+004.04276877E+004.00864374E+003.99943902E+004.00620111E+00
ZMBSP.POP.TOTL1.05312210E+071.08241250E+071.11204090E+071.14219840E+071.17317460E+071.20521560E+07
ZMBSP.POP.GROW2.80705843E+002.74331654E+002.70046295E+002.67578507E+002.67585813E+002.69450644E+00
ZWESP.URB.TOTL4.12598700E+064.22551900E+064.32330700E+064.35604100E+064.38192000E+064.41384500E+06
ZWESP.URB.GROW2.52373518E+002.38368296E+002.28785252E+007.54299867E-015.92336717E-017.25920717E-01
ZWESP.POP.TOTL1.22222510E+071.23661650E+071.25005250E+071.26338970E+071.27775110E+071.29400320E+07
ZWESP.POP.GROW1.29878201E+001.17059711E+001.08065293E+001.06127964E+001.13032327E+001.26390895E+00

Step 1 - convert years column into values

(def pop2 (tc/pivot->longer world-bank-pop (map str (range 2000 2018)) {:drop-missing? false
                                                                         :target-columns ["year"]
                                                                         :value-column-name "value"}))
pop2

data/world_bank_pop.csv.gz [19008 4]:

countryindicatoryearvalue
ABWSP.URB.TOTL20134.43600000E+04
ABWSP.URB.GROW20136.69503994E-01
ABWSP.POP.TOTL20131.03187000E+05
ABWSP.POP.GROW20135.92914005E-01
AFGSP.URB.TOTL20137.73396400E+06
AFGSP.URB.GROW20134.19297967E+00
AFGSP.POP.TOTL20133.17316880E+07
AFGSP.POP.GROW20133.31522413E+00
AGOSP.URB.TOTL20131.61194910E+07
AGOSP.URB.GROW20134.72272270E+00
ZAFSP.URB.GROW20122.23077040E+00
ZAFSP.POP.TOTL20125.29982130E+07
ZAFSP.POP.GROW20121.39596592E+00
ZMBSP.URB.TOTL20125.93201300E+06
ZMBSP.URB.GROW20124.25944078E+00
ZMBSP.POP.TOTL20121.46999370E+07
ZMBSP.POP.GROW20123.00513283E+00
ZWESP.URB.TOTL20124.83015300E+06
ZWESP.URB.GROW20121.67857380E+00
ZWESP.POP.TOTL20121.47108260E+07
ZWESP.POP.GROW20122.22830616E+00

Step 2 - separate "indicate" column

(def pop3 (tc/separate-column pop2
                               "indicator" ["area" "variable"]
                               #(rest (clojure.string/split % #"\."))))
pop3

data/world_bank_pop.csv.gz [19008 5]:

countryareavariableyearvalue
ABWURBTOTL20134.43600000E+04
ABWURBGROW20136.69503994E-01
ABWPOPTOTL20131.03187000E+05
ABWPOPGROW20135.92914005E-01
AFGURBTOTL20137.73396400E+06
AFGURBGROW20134.19297967E+00
AFGPOPTOTL20133.17316880E+07
AFGPOPGROW20133.31522413E+00
AGOURBTOTL20131.61194910E+07
AGOURBGROW20134.72272270E+00
ZAFURBGROW20122.23077040E+00
ZAFPOPTOTL20125.29982130E+07
ZAFPOPGROW20121.39596592E+00
ZMBURBTOTL20125.93201300E+06
ZMBURBGROW20124.25944078E+00
ZMBPOPTOTL20121.46999370E+07
ZMBPOPGROW20123.00513283E+00
ZWEURBTOTL20124.83015300E+06
ZWEURBGROW20121.67857380E+00
ZWEPOPTOTL20121.47108260E+07
ZWEPOPGROW20122.22830616E+00

Step 3 - Make columns based on "variable" values.

(tc/pivot->wider pop3 "variable" "value" {:drop-missing? false})

data/world_bank_pop.csv.gz [9504 5]:

countryareayearTOTLGROW
ABWURB20134.43600000E+040.66950399
ABWPOP20131.03187000E+050.59291401
AFGURB20137.73396400E+064.19297967
AFGPOP20133.17316880E+073.31522413
AGOURB20131.61194910E+074.72272270
AGOPOP20132.59983400E+073.53182419
ALBURB20131.60350500E+061.74363937
ALBPOP20132.89509200E+06-0.18321138
ANDURB20137.15270000E+04-2.11923331
ANDPOP20138.07880000E+04-2.01331401
WSMPOP20121.89194000E+050.81144852
XKXURB2012
XKXPOP20121.80520000E+060.78972659
YEMURB20128.20982800E+064.49478765
YEMPOP20122.49099690E+072.67605025
ZAFURB20123.35330290E+072.23077040
ZAFPOP20125.29982130E+071.39596592
ZMBURB20125.93201300E+064.25944078
ZMBPOP20121.46999370E+073.00513283
ZWEURB20124.83015300E+061.67857380
ZWEPOP20121.47108260E+072.22830616


Multi-choice

(def multi (tc/dataset {:id [1 2 3 4]
                         :choice1 ["A" "C" "D" "B"]
                         :choice2 ["B" "B" nil "D"]
                         :choice3 ["C" nil nil nil]}))
multi

_unnamed [4 4]:

:id:choice1:choice2:choice3
1ABC
2CB
3D
4BD

Step 1 - convert all choices into rows and add artificial column to all values which are not missing.

(def multi2 (-> multi
                (tc/pivot->longer (complement #{:id}))
                (tc/add-column :checked true)))
multi2

_unnamed [8 4]:

| :id | :$column | :$value | :checked | | |----:|--------------------|----------|------| | 1 | :choice1 | A | true | | 2 | :choice1 | C | true | | 3 | :choice1 | D | true | | 4 | :choice1 | B | true | | 1 | :choice2 | B | true | | 2 | :choice2 | B | true | | 4 | :choice2 | D | true | | 1 | :choice3 | C | true |

Step 2 - Convert back to wide form with actual choices as columns

(-> multi2
    (tc/drop-columns :$column)
    (tc/pivot->wider :$value :checked {:drop-missing? false})
    (tc/order-by :id))

_unnamed [4 5]:

:idACDB
1truetrue true
2 true true
3 true
4 truetrue


Construction

(def construction (tc/dataset "data/construction.csv"))
(def construction-unit-map {"1 unit" "1"
                            "2 to 4 units" "2-4"
                            "5 units or more" "5+"})
construction

data/construction.csv [9 9]:

YearMonth1 unit2 to 4 units5 units or moreNortheastMidwestSouthWest
2018January859 348114169596339
2018February882 400138160655336
2018March862 356150154595330
2018April797 447144196613304
2018May875 36490169673319
2018June867 34276170610360
2018July829 360108183594310
2018August939 28690205649286
2018September835 304117175560296

Conversion 1 - Group two column types

(-> construction
    (tc/pivot->longer #"^[125NWS].*|Midwest" {:target-columns [:units :region]
                                               :splitter (fn [col-name]
                                                           (if (re-matches #"^[125].*" col-name)
                                                             [(construction-unit-map col-name) nil]
                                                             [nil col-name]))
                                               :value-column-name :n
                                               :drop-missing? false}))

data/construction.csv [63 5]:

YearMonth:units:region:n
2018January1 859
2018February1 882
2018March1 862
2018April1 797
2018May1 875
2018June1 867
2018July1 829
2018August1 939
2018September1 835
2018January2-4
2018August South649
2018September South560
2018January West339
2018February West336
2018March West330
2018April West304
2018May West319
2018June West360
2018July West310
2018August West286
2018September West296

Conversion 2 - Convert to longer form and back and rename columns

(-> construction
    (tc/pivot->longer #"^[125NWS].*|Midwest" {:target-columns [:units :region]
                                               :splitter (fn [col-name]
                                                           (if (re-matches #"^[125].*" col-name)
                                                             [(construction-unit-map col-name) nil]
                                                             [nil col-name]))
                                               :value-column-name :n
                                               :drop-missing? false})
    (tc/pivot->wider [:units :region] :n {:drop-missing? false})
    (tc/rename-columns (zipmap (vals construction-unit-map)
                                (keys construction-unit-map))))

data/construction.csv [9 9]:

YearMonth12 to 4 units5 units or moreNortheastMidwestSouthWest
2018January859 348114169596339
2018February882 400138160655336
2018March862 356150154595330
2018April797 447144196613304
2018May875 36490169673319
2018June867 34276170610360
2018July829 360108183594310
2018August939 28690205649286
2018September835 304117175560296

Various operations on stocks, examples taken from gather and spread manuals.

(def stocks-tidyr (tc/dataset "data/stockstidyr.csv"))
stocks-tidyr

data/stockstidyr.csv [10 4]:

timeXYZ
2009-01-011.30989806-1.89040193-1.77946880
2009-01-02-0.29993804-1.824730902.39892513
2009-01-030.53647501-1.03606860-3.98697977
2009-01-04-1.88390802-0.52178390-2.83065490
2009-01-05-0.96052361-2.216833491.43715171
2009-01-06-1.18528966-2.893509243.39784140
2009-01-07-0.85207056-2.16794818-1.20108258
2009-01-080.25234172-0.32854117-1.53160473
2009-01-090.402571361.96407898-6.80878830
2009-01-10-0.643835002.68618382-2.55909321

Convert to longer form

(def stocks-long (tc/pivot->longer stocks-tidyr ["X" "Y" "Z"] {:value-column-name :price
                                                                :target-columns :stocks}))
stocks-long

data/stockstidyr.csv [30 3]:

time:stocks:price
2009-01-01X1.30989806
2009-01-02X-0.29993804
2009-01-03X0.53647501
2009-01-04X-1.88390802
2009-01-05X-0.96052361
2009-01-06X-1.18528966
2009-01-07X-0.85207056
2009-01-08X0.25234172
2009-01-09X0.40257136
2009-01-10X-0.64383500
2009-01-10Y2.68618382
2009-01-01Z-1.77946880
2009-01-02Z2.39892513
2009-01-03Z-3.98697977
2009-01-04Z-2.83065490
2009-01-05Z1.43715171
2009-01-06Z3.39784140
2009-01-07Z-1.20108258
2009-01-08Z-1.53160473
2009-01-09Z-6.80878830
2009-01-10Z-2.55909321

Convert back to wide form

(tc/pivot->wider stocks-long :stocks :price)

data/stockstidyr.csv [10 4]:

timeXYZ
2009-01-011.30989806-1.89040193-1.77946880
2009-01-02-0.29993804-1.824730902.39892513
2009-01-030.53647501-1.03606860-3.98697977
2009-01-04-1.88390802-0.52178390-2.83065490
2009-01-05-0.96052361-2.216833491.43715171
2009-01-06-1.18528966-2.893509243.39784140
2009-01-07-0.85207056-2.16794818-1.20108258
2009-01-080.25234172-0.32854117-1.53160473
2009-01-090.402571361.96407898-6.80878830
2009-01-10-0.643835002.68618382-2.55909321

Convert to wide form on time column (let’s limit values to a couple of rows)

(-> stocks-long
    (tc/select-rows (range 0 30 4))
    (tc/pivot->wider "time" :price {:drop-missing? false}))

data/stockstidyr.csv [3 6]:

:stocks2009-01-012009-01-052009-01-092009-01-032009-01-07
Y -1.0360686-2.16794818
X1.30989806-0.960523610.40257136
Z-1.779468801.43715171-6.80878830

Join/Concat Datasets

Dataset join and concatenation functions.

Joins accept left-side and right-side datasets and columns selector. Options are the same as in tech.ml.dataset functions.

A column selector can be a map with :left and :right keys to specify column names separate for left and right dataset.

The difference between tech.ml.dataset join functions are: arguments order (first datasets) and possibility to join on multiple columns.

Multiple columns joins create temporary index column from column selection. The method for creating index is based on :hashing option and defaults to identity. Prior to 7.000-beta-50 hash function was used, which caused hash collision for certain cases.

Additionally set operations are defined: intersect and difference.

To concat two datasets rowwise you can choose:

  • concat - concats rows for matching columns, the number of columns should be equal.
  • union - like concat but returns unique values
  • bind - concats rows add missing, empty columns

To add two datasets columnwise use bind. The number of rows should be equal.

Datasets used in examples:

(def ds1 (tc/dataset {:a [1 2 1 2 3 4 nil nil 4]
                       :b (range 101 110)
                       :c (map str "abs tract")}))
(def ds2 (tc/dataset {:a [nil 1 2 5 4 3 2 1 nil]
                      :b (range 110 101 -1)
                      :c (map str "datatable")
                      :d (symbol "X")
                      :e [3 4 5 6 7 nil 8 1 1]}))
ds1
ds2

_unnamed [9 3]:

:a:b:c
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t

_unnamed [9 5]:

:a:b:c:d:e
110dX3
1109aX4
2108tX5
5107aX6
4106tX7
3105aX
2104bX8
1103lX1
102eX1

Left

(tc/left-join ds1 ds2 :b)

left-outer-join [9 8]:

:b:a:c:right.b:right.a:right.c:d:e
1094t1091aX4
108 c1082tX5
107 a1075aX6
1064r1064tX7
1053t1053aX
1042 1042bX8
1031s1031lX1
1022b102 eX1
1011a

(tc/left-join ds2 ds1 :b)

left-outer-join [9 8]:

:b:a:c:d:e:right.b:right.a:right.c
102 eX11022b
1031lX11031s
1042bX81042
1053aX 1053t
1064tX71064r
1075aX6107 a
1082tX5108 c
1091aX41094t
110 dX3

(tc/left-join ds1 ds2 [:a :b])

left-outer-join [9 8]:

:a:b:c:right.a:right.b:right.c:d:e
4106r4106tX7
3105t3105aX
2104 2104bX8
1103s1103lX1
1101a
2102b
107a
108c
4109t

(tc/left-join ds2 ds1 [:a :b])

left-outer-join [9 8]:

:a:b:c:d:e:right.a:right.b:right.c
1103lX11103s
2104bX82104
3105aX 3105t
4106tX74106r
110dX3
1109aX4
2108tX5
5107aX6
102eX1

(tc/left-join ds1 ds2 {:left :a :right :e})

left-outer-join [11 8]:

:a:b:c:e:right.a:right.b:right.c:d
3105t3 110dX
4106r41109aX
4109t41109aX
1101a11103lX
1103s11103lX
1101a1 102eX
1103s1 102eX
2102b
2104
107a
108c

(tc/left-join ds2 ds1 {:left :e :right :a})

left-outer-join [12 8]:

:e:a:b:c:d:right.a:right.b:right.c
11103lX1101a
1 102eX1101a
11103lX1103s
1 102eX1103s
3 110dX3105t
41109aX4106r
41109aX4109t
52108tX
65107aX
74106tX
3105aX
82104bX

Right

(tc/right-join ds1 ds2 :b)

right-outer-join [9 8]:

:b:a:c:right.b:right.a:right.c:d:e
1094t1091aX4
108 c1082tX5
107 a1075aX6
1064r1064tX7
1053t1053aX
1042 1042bX8
1031s1031lX1
1022b102 eX1
110 dX3

(tc/right-join ds2 ds1 :b)

right-outer-join [9 8]:

:b:a:c:d:e:right.b:right.a:right.c
102 eX11022b
1031lX11031s
1042bX81042
1053aX 1053t
1064tX71064r
1075aX6107 a
1082tX5108 c
1091aX41094t
1011a

(tc/right-join ds1 ds2 [:a :b])

right-outer-join [9 8]:

:a:b:c:right.a:right.b:right.c:d:e
4106r4106tX7
3105t3105aX
2104 2104bX8
1103s1103lX1
110dX3
1109aX4
2108tX5
5107aX6
102eX1

(tc/right-join ds2 ds1 [:a :b])

right-outer-join [9 8]:

:a:b:c:d:e:right.a:right.b:right.c
1103lX11103s
2104bX82104
3105aX 3105t
4106tX74106r
1101a
2102b
107a
108c
4109t

(tc/right-join ds1 ds2 {:left :a :right :e})

right-outer-join [12 8]:

:a:b:c:e:right.a:right.b:right.c:d
3105t3 110dX
4106r41109aX
4109t41109aX
1101a11103lX
1103s11103lX
1101a1 102eX
1103s1 102eX
52108tX
65107aX
74106tX
3105aX
82104bX

(tc/right-join ds2 ds1 {:left :e :right :a})

right-outer-join [11 8]:

:e:a:b:c:d:right.a:right.b:right.c
11103lX1101a
1 102eX1101a
11103lX1103s
1 102eX1103s
3 110dX3105t
41109aX4106r
41109aX4109t
2102b
2104
107a
108c

Inner

(tc/inner-join ds1 ds2 :b)

inner-join [8 7]:

:b:a:c:right.a:right.c:d:e
1094t1aX4
108 c2tX5
107 a5aX6
1064r4tX7
1053t3aX
1042 2bX8
1031s1lX1
1022b eX1

(tc/inner-join ds2 ds1 :b)

inner-join [8 7]:

:b:a:c:d:e:right.a:right.c
102 eX12b
1031lX11s
1042bX82
1053aX 3t
1064tX74r
1075aX6 a
1082tX5 c
1091aX44t

(tc/inner-join ds1 ds2 [:a :b])

inner-join [4 8]:

:a:b:c:right.a:right.b:right.c:d:e
4106r4106tX7
3105t3105aX
2104 2104bX8
1103s1103lX1

(tc/inner-join ds2 ds1 [:a :b])

inner-join [4 8]:

:a:b:c:d:e:right.a:right.b:right.c
1103lX11103s
2104bX82104
3105aX 3105t
4106tX74106r

(tc/inner-join ds1 ds2 {:left :a :right :e})

inner-join [7 7]:

:a:b:c:right.a:right.b:right.c:d
3105t 110dX
4106r1109aX
4109t1109aX
1101a1103lX
1103s1103lX
1101a 102eX
1103s 102eX

(tc/inner-join ds2 ds1 {:left :e :right :a})

inner-join [7 7]:

:e:a:b:c:d:right.b:right.c
11103lX101a
1 102eX101a
11103lX103s
1 102eX103s
3 110dX105t
41109aX106r
41109aX109t

Full

Join keeping all rows

(tc/full-join ds1 ds2 :b)

full-join [10 8]:

:b:a:c:right.b:right.a:right.c:d:e
1094t1091aX4
108 c1082tX5
107 a1075aX6
1064r1064tX7
1053t1053aX
1042 1042bX8
1031s1031lX1
1022b102 eX1
1011a
110 dX3

(tc/full-join ds2 ds1 :b)

full-join [10 8]:

:b:a:c:d:e:right.b:right.a:right.c
102 eX11022b
1031lX11031s
1042bX81042
1053aX 1053t
1064tX71064r
1075aX6107 a
1082tX5108 c
1091aX41094t
110 dX3
1011a

(tc/full-join ds1 ds2 [:a :b])

full-join [14 8]:

:a:b:c:right.a:right.b:right.c:d:e
4106r4106tX7
3105t3105aX
2104 2104bX8
1103s1103lX1
1101a
2102b
107a
108c
4109t
110dX3
1109aX4
2108tX5
5107aX6
102eX1

(tc/full-join ds2 ds1 [:a :b])

full-join [14 8]:

:a:b:c:d:e:right.a:right.b:right.c
1103lX11103s
2104bX82104
3105aX 3105t
4106tX74106r
110dX3
1109aX4
2108tX5
5107aX6
102eX1
1101a
2102b
107a
108c
4109t

(tc/full-join ds1 ds2 {:left :a :right :e})

full-join [16 8]:

:a:b:c:e:right.a:right.b:right.c:d
3105t3 110dX
4106r41109aX
4109t41109aX
1101a11103lX
1103s11103lX
1101a1 102eX
1103s1 102eX
2102b
2104
107a
108c
52108tX
65107aX
74106tX
3105aX
82104bX

(tc/full-join ds2 ds1 {:left :e :right :a})

full-join [16 8]:

:e:a:b:c:d:right.a:right.b:right.c
11103lX1101a
1 102eX1101a
11103lX1103s
1 102eX1103s
3 110dX3105t
41109aX4106r
41109aX4109t
52108tX
65107aX
74106tX
3105aX
82104bX
2102b
2104
107a
108c

Semi

Return rows from ds1 matching ds2

(tc/semi-join ds1 ds2 :b)

semi-join [8 3]:

:b:a:c
1094t
108 c
107 a
1064r
1053t
1042
1031s
1022b

(tc/semi-join ds2 ds1 :b)

semi-join [8 5]:

:b:a:c:d:e
102 eX1
1031lX1
1042bX8
1053aX
1064tX7
1075aX6
1082tX5
1091aX4

(tc/semi-join ds1 ds2 [:a :b])

semi-join [4 3]:

:a:b:c
4106r
3105t
2104
1103s

(tc/semi-join ds2 ds1 [:a :b])

semi-join [4 5]:

:a:b:c:d:e
1103lX1
2104bX8
3105aX
4106tX7

(tc/semi-join ds1 ds2 {:left :a :right :e})

semi-join [7 3]:

:a:b:c
3105t
4106r
4109t
1101a
1103s
107a
108c

(tc/semi-join ds2 ds1 {:left :e :right :a})

semi-join [5 5]:

:e:a:b:c:d
11103lX
1 102eX
3 110dX
41109aX
3105aX

Anti

Return rows from ds1 not matching ds2

(tc/anti-join ds1 ds2 :b)

anti-join [1 3]:

:b:a:c
1011a

(tc/anti-join ds2 ds1 :b)

anti-join [1 5]:

:b:a:c:d:e
110 dX3

(tc/anti-join ds1 ds2 [:a :b])

anti-join [5 3]:

:a:b:c
1101a
2102b
107a
108c
4109t

(tc/anti-join ds1 ds2 {:left :a :right :e})

anti-join [2 3]:

:a:b:c
2102b
2104

(tc/anti-join ds2 ds1 {:left :e :right :a})

anti-join [4 5]:

:e:a:b:c:d
52108tX
65107aX
74106tX
82104bX

Hashing

When :hashing option is used, data from join columns are preprocessed by applying join-columns funtion with :result-type set to the value of :hashing. This helps to create custom joining behaviour. Function used for hashing will get vector of row values from join columns.

In the following example we will join columns on value modulo 5.

(tc/left-join ds1 ds2 :b {:hashing (fn [[v]] (mod v 5))})

left-outer-join [16 8]:

:a:b:c:right.a:right.b:right.c:d:e
3105t 110dX3
2104 1109aX4
4109t1109aX4
1103s2108tX5
108c2108tX5
2102b5107aX6
107a5107aX6
1101a4106tX7
4106r4106tX7
3105t3105aX
2104 2104bX8
4109t2104bX8
1103s1103lX1
108c1103lX1
2102b 102eX1
107a 102eX1

Cross

Cross product from selected columns

(tc/cross-join ds1 ds2 [:a :b])

cross-join [81 4]:

:a:b:right.a:right.b
1101 110
11011109
11012108
11015107
11014106
11013105
11012104
11011103
1101 102
2102 110
1081103
108 102
4109 110
41091109
41092108
41095107
41094106
41093105
41092104
41091103
4109 102

(tc/cross-join ds1 ds2 {:left [:a :b] :right :e})

cross-join [81 3]:

:a:b:e
11013
11014
11015
11016
11017
1101
11018
11011
11011
21023
1081
1081
41093
41094
41095
41096
41097
4109
41098
41091
41091

Expand

Similar to cross product but works on a single dataset.

(tc/expand ds2 :a :c :d)

cross-join [36 3]:

:a:c:d
dX
aX
tX
bX
lX
eX
1dX
1aX
1tX
1bX
4aX
4tX
4bX
4lX
4eX
3dX
3aX
3tX
3bX
3lX
3eX

Columns can be also bundled (nested) in tuples which are treated as a single entity during cross product.

(tc/expand ds2 [:a :c] [:e :b])

cross-join [81 4]:

:a:c:e:b
d3110
d4109
d5108
d6107
d7106
d 105
d8104
d1103
d1102
1a3110
1l1103
1l1102
e3110
e4109
e5108
e6107
e7106
e 105
e8104
e1103
e1102

Complete

Same as expand with all other columns preserved (filled with missing values if necessary).

(tc/complete ds2 :a :c :d)

left-outer-join [36 5]:

:a:c:d:b:e
dX1103
1aX1094
2tX1085
5aX1076
4tX1067
3aX105
2bX1048
1lX1031
eX1021
aX
5eX
4dX
4aX
4bX
4lX
4eX
3dX
3tX
3bX
3lX
3eX

(tc/complete ds2 [:a :c] [:e :b])

left-outer-join [81 5]:

:a:c:e:b:d
d3110X
1a4109X
2t5108X
5a6107X
4t7106X
3a 105X
2b8104X
1l1103X
e1102X
d4109
1l 105
1l8104
1l1102
e3110
e4109
e5108
e6107
e7106
e 105
e8104
e1103

asof

(def left-ds (tc/dataset {:a [1 5 10]
                          :left-val ["a" "b" "c"]}))
(def right-ds (tc/dataset {:a [1 2 3 6 7]
                           :right-val [:a :b :c :d :e]}))
left-ds
right-ds

_unnamed [3 2]:

:a:left-val
1a
5b
10c

_unnamed [5 2]:

:a:right-val
1:a
2:b
3:c
6:d
7:e
(tc/asof-join left-ds right-ds :a)

asof-<= [3 4]:

:a:left-val:right.a:right-val
1a1:a
5b6:d
10c
(tc/asof-join left-ds right-ds :a {:asof-op :nearest})

asof-nearest [3 4]:

:a:left-val:right.a:right-val
1a1:a
5b6:d
10c7:e
(tc/asof-join left-ds right-ds :a {:asof-op :>=})

asof->= [3 4]:

:a:left-val:right.a:right-val
1a1:a
5b3:c
10c7:e

Concat

contact joins rows from other datasets

(tc/concat ds1)

_unnamed [9 3]:

:a:b:c
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t

concat-copying ensures all readers are evaluated.

(tc/concat-copying ds1)

_unnamed [9 3]:

:a:b:c
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t

(tc/concat ds1 (tc/drop-columns ds2 :d))

_unnamed [18 4]:

:a:b:c:e
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t
110d3
1109a4
2108t5
5107a6
4106t7
3105a
2104b8
1103l1
102e1

(apply tc/concat (repeatedly 3 #(tc/random DS)))

_unnamed [27 4]:

:V1:V2:V3:V4
261.5C
240.5A
240.5A
110.5A
191.5C
151.0B
261.5C
170.5A
110.5A
131.5C
281.0B
261.5C
131.5C
110.5A
170.5A
110.5A
170.5A
261.5C
240.5A
281.0B
221.0B
281.0B
191.5C
221.0B
281.0B
240.5A
191.5C
Concat grouped dataset

Concatenation of grouped datasets results also in grouped dataset.

(tc/concat (tc/group-by DS [:V3])
           (tc/group-by DS [:V4]))

_unnamed [6 3]:

:name:group-id:data
{:V3 0.5}0Group: {:V3 0.5} [3 4]:
{:V3 1.0}1Group: {:V3 1.0} [3 4]:
{:V3 1.5}2Group: {:V3 1.5} [3 4]:
{:V4 “A”}3Group: {:V4 “A”} [3 4]:
{:V4 “B”}4Group: {:V4 “B”} [3 4]:
{:V4 “C”}5Group: {:V4 “C”} [3 4]:

Union

The same as concat but returns unique rows

(apply tc/union (tc/drop-columns ds2 :d) (repeat 10 ds1))

union [18 4]:

:a:b:c:e
110d3
1109a4
2108t5
5107a6
4106t7
3105a
2104b8
1103l1
102e1
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t

(apply tc/union (repeatedly 10 #(tc/random DS)))

union [9 4]:

:V1:V2:V3:V4
170.5A
240.5A
110.5A
151.0B
191.5C
221.0B
131.5C
281.0B
261.5C

Bind

bind adds empty columns during concat

(tc/bind ds1 ds2)

_unnamed [18 5]:

:a:b:c:e:d
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t
110d3X
1109a4X
2108t5X
5107a6X
4106t7X
3105a X
2104b8X
1103l1X
102e1X

(tc/bind ds2 ds1)

_unnamed [18 5]:

:a:b:c:d:e
110dX3
1109aX4
2108tX5
5107aX6
4106tX7
3105aX
2104bX8
1103lX1
102eX1
1101a
2102b
1103s
2104
3105t
4106r
107a
108c
4109t

Append

append concats columns

(tc/append ds1 ds2)

_unnamed [9 8]:

:a:b:c:a:b:c:d:e
1101a 110dX3
2102b1109aX4
1103s2108tX5
2104 5107aX6
3105t4106tX7
4106r3105aX
107a2104bX8
108c1103lX1
4109t 102eX1

Intersection

(tc/intersect (tc/select-columns ds1 :b)
              (tc/select-columns ds2 :b))

intersection [8 1]:

:b
109
108
107
106
105
104
103
102

Difference

(tc/difference (tc/select-columns ds1 :b)
               (tc/select-columns ds2 :b))

difference [1 1]:

:b
101

(tc/difference (tc/select-columns ds2 :b)
               (tc/select-columns ds1 :b))

difference [1 1]:

:b
110

Split into train/test

In ML world very often you need to test given model and prepare collection of train and test datasets. split creates new dataset with two additional columns:

  • :$split-name - with :train, :test, :split-2, … values
  • :$split-id - id of splitted group (for k-fold and repeating)

split-type can be one of the following:

  • :kfold (default) - k-fold strategy, :k defines number of folds (defaults to 5), produces k splits
  • :bootstrap - :ratio defines ratio of observations put into result (defaults to 1.0), produces 1 split
  • :holdout - split into two or more parts with given ratio(s) (defaults to 2/3), produces 1 split
  • :holdouts - splits into two parts for ascending ratio. Range of rations is given by steps option
  • :loo - leave one out, produces the same number of splits as number of observations

:holdout can accept also probabilites or ratios and can split to more than 2 subdatasets

Additionally you can provide:

  • :seed - for random number generator
  • :shuffle? - turn on/off shuffle of the rows (default: true)
  • :repeats - repeat procedure :repeats times
  • :partition-selector - same as in group-by for stratified splitting to reflect dataset structure in splits.
  • :split-names names of subdatasets different than default, ie. [:train :test :split-2 ...]
  • :split-col-name - a column where name of split is stored, either :train or :test values (default: :$split-name)
  • :split-id-col-name - a column where id of the train/test pair is stored (default: :$split-id)

In case of grouped dataset each group is processed separately.

See more

(def for-splitting (tc/dataset (map-indexed (fn [id v] {:id id
                                                        :partition v
                                                        :group (rand-nth [:g1 :g2 :g3])})
                                            (concat (repeat 20 :a) (repeat 5 :b)))))
for-splitting

_unnamed [25 3]:

:id:partition:group
0:a:g3
1:a:g3
2:a:g3
3:a:g2
4:a:g3
5:a:g3
6:a:g2
7:a:g1
8:a:g3
9:a:g1
14:a:g3
15:a:g1
16:a:g3
17:a:g1
18:a:g2
19:a:g1
20:b:g2
21:b:g1
22:b:g2
23:b:g2
24:b:g2

k-Fold

Returns k=5 maps

(-> for-splitting
    (tc/split)
    (tc/head 30))

_unnamed, (splitted) [30 5]:

:id:partition:group:split − nam**e|:split-id
23:b:g2:train0
0:a:g3:train0
3:a:g2:train0
13:a:g3:train0
24:b:g2:train0
18:a:g2:train0
10:a:g2:train0
22:b:g2:train0
1:a:g3:train0
20:b:g2:train0
5:a:g3:train0
17:a:g1:train0
14:a:g3:train0
4:a:g3:train0
2:a:g3:train0
8:a:g3:train0
19:a:g1:train0
6:a:g2:train0
12:a:g3:train0
16:a:g3:train0
11:a:g2:test0
7:a:g1:test0
15:a:g1:test0
21:b:g1:test0
9:a:g1:test0
11:a:g2:train1
7:a:g1:train1
15:a:g1:train1
21:b:g1:train1
9:a:g1:train1

Partition according to :k column to reflect it’s distribution

(-> for-splitting
    (tc/split :kfold {:partition-selector :partition})
    (tc/head 30))

_unnamed, (splitted) [30 5]:

:id:partition:group:split − nam**e|:split-id
11:a:g2:train0
12:a:g3:train0
8:a:g3:train0
15:a:g1:train0
5:a:g3:train0
18:a:g2:train0
2:a:g3:train0
14:a:g3:train0
9:a:g1:train0
4:a:g3:train0
19:a:g1:train0
1:a:g3:train0
6:a:g2:train0
13:a:g3:train0
16:a:g3:train0
17:a:g1:train0
10:a:g2:test0
0:a:g3:test0
3:a:g2:test0
7:a:g1:test0
10:a:g2:train1
0:a:g3:train1
3:a:g2:train1
7:a:g1:train1
5:a:g3:train1
18:a:g2:train1
2:a:g3:train1
14:a:g3:train1
9:a:g1:train1
4:a:g3:train1

Bootstrap

(tc/split for-splitting :bootstrap)

_unnamed, (splitted) [33 5]:

:id:partition:group:split − nam**e|:split-id
16:a:g3:train0
16:a:g3:train0
21:b:g1:train0
20:b:g2:train0
15:a:g1:train0
1:a:g3:train0
5:a:g3:train0
18:a:g2:train0
4:a:g3:train0
21:b:g1:train0
20:b:g2:train0
22:b:g2:train0
21:b:g1:train0
0:a:g3:test0
2:a:g3:test0
3:a:g2:test0
7:a:g1:test0
8:a:g3:test0
9:a:g1:test0
10:a:g2:test0
12:a:g3:test0

with repeats, to get 100 splits

(-> for-splitting
    (tc/split :bootstrap {:repeats 100})
    (:$split-id)
    (distinct)
    (count))

100

Holdout

with small ratio

(tc/split for-splitting :holdout {:ratio 0.2})

_unnamed, (splitted) [25 5]:

:id:partition:group:split − nam**e|:split-id
19:a:g1:train0
4:a:g3:train0
16:a:g3:train0
15:a:g1:train0
22:b:g2:train0
7:a:g1:test0
11:a:g2:test0
2:a:g3:test0
20:b:g2:test0
6:a:g2:test0
14:a:g3:test0
21:b:g1:test0
3:a:g2:test0
12:a:g3:test0
9:a:g1:test0
13:a:g3:test0
0:a:g3:test0
5:a:g3:test0
18:a:g2:test0
23:b:g2:test0
1:a:g3:test0

you can split to more than two subdatasets with holdout

(tc/split for-splitting :holdout {:ratio [0.1 0.2 0.3 0.15 0.25]})

_unnamed, (splitted) [25 5]:

:id:partition:group:split − nam**e|:split-id
22:b:g2:train0
17:a:g1:train0
23:b:g2:test0
5:a:g3:test0
15:a:g1:test0
18:a:g2:test0
20:b:g2:test0
0:a:g3:split-20
4:a:g3:split-20
2:a:g3:split-20
13:a:g3:split-30
16:a:g3:split-30
21:b:g1:split-30
7:a:g1:split-40
6:a:g2:split-40
9:a:g1:split-40
10:a:g2:split-40
19:a:g1:split-40
14:a:g3:split-40
3:a:g2:split-40
1:a:g3:split-40

you can use also proportions with custom names

(tc/split for-splitting :holdout {:ratio [5 3 11 2]
                                  :split-names ["small" "smaller" "big" "the rest"]})

_unnamed, (splitted) [25 5]:

:id:partition:group:split − nam**e|:split-id
10:a:g2small0
2:a:g3small0
14:a:g3small0
13:a:g3small0
15:a:g1small0
17:a:g1smaller0
9:a:g1smaller0
21:b:g1smaller0
5:a:g3big0
24:b:g2big0
20:b:g2big0
0:a:g3big0
19:a:g1big0
1:a:g3big0
6:a:g2big0
23:b:g2big0
8:a:g3big0
3:a:g2the rest0
16:a:g3the rest0
7:a:g1the rest0
11:a:g2the rest0

Holdouts

With ratios from 5% to 95% of the dataset with step 1.5 generates 15 splits with ascending rows in train dataset.

(-> (tc/split for-splitting :holdouts {:steps [0.05 0.95 1.5]
                                       :shuffle? false})
    (tc/group-by [:$split-id :$split-name]))

_unnamed [30 3]:

:name:group-id:data
{:$split-id 0, :$split-name :train}0Group: {:$split-id 0, :$split-name :train} [1 5]:
{:$split-id 0, :$split-name :test}1Group: {:$split-id 0, :$split-name :test} [24 5]:
{:$split-id 1, :$split-name :train}2Group: {:$split-id 1, :$split-name :train} [2 5]:
{:$split-id 1, :$split-name :test}3Group: {:$split-id 1, :$split-name :test} [23 5]:
{:$split-id 2, :$split-name :train}4Group: {:$split-id 2, :$split-name :train} [4 5]:
{:$split-id 2, :$split-name :test}5Group: {:$split-id 2, :$split-name :test} [21 5]:
{:$split-id 3, :$split-name :train}6Group: {:$split-id 3, :$split-name :train} [5 5]:
{:$split-id 3, :$split-name :test}7Group: {:$split-id 3, :$split-name :test} [20 5]:
{:$split-id 4, :$split-name :train}8Group: {:$split-id 4, :$split-name :train} [7 5]:
{:$split-id 4, :$split-name :test}9Group: {:$split-id 4, :$split-name :test} [18 5]:
{:$split-id 9, :$split-name :test}19Group: {:$split-id 9, :$split-name :test} [11 5]:
{:$split-id 10, :$split-name :train}20Group: {:$split-id 10, :$split-name :train} [16 5]:
{:$split-id 10, :$split-name :test}21Group: {:$split-id 10, :$split-name :test} [9 5]:
{:$split-id 11, :$split-name :train}22Group: {:$split-id 11, :$split-name :train} [17 5]:
{:$split-id 11, :$split-name :test}23Group: {:$split-id 11, :$split-name :test} [8 5]:
{:$split-id 12, :$split-name :train}24Group: {:$split-id 12, :$split-name :train} [19 5]:
{:$split-id 12, :$split-name :test}25Group: {:$split-id 12, :$split-name :test} [6 5]:
{:$split-id 13, :$split-name :train}26Group: {:$split-id 13, :$split-name :train} [20 5]:
{:$split-id 13, :$split-name :test}27Group: {:$split-id 13, :$split-name :test} [5 5]:
{:$split-id 14, :$split-name :train}28Group: {:$split-id 14, :$split-name :train} [22 5]:
{:$split-id 14, :$split-name :test}29Group: {:$split-id 14, :$split-name :test} [3 5]:

Leave One Out

(-> for-splitting
    (tc/split :loo)
    (tc/head 30))

_unnamed, (splitted) [30 5]:

:id:partition:group:split − nam**e|:split-id
23:b:g2:train0
3:a:g2:train0
20:b:g2:train0
2:a:g3:train0
7:a:g1:train0
24:b:g2:train0
21:b:g1:train0
14:a:g3:train0
13:a:g3:train0
0:a:g3:train0
8:a:g3:train0
19:a:g1:train0
17:a:g1:train0
11:a:g2:train0
5:a:g3:train0
12:a:g3:train0
16:a:g3:train0
1:a:g3:train0
22:b:g2:train0
18:a:g2:train0
15:a:g1:train0
10:a:g2:train0
6:a:g2:train0
9:a:g1:train0
4:a:g3:test0
4:a:g3:train1
3:a:g2:train1
20:b:g2:train1
2:a:g3:train1
7:a:g1:train1
(-> for-splitting
    (tc/split :loo)
    (tc/row-count))

625

Grouped dataset with partitioning

(-> for-splitting
    (tc/group-by :group)
    (tc/split :bootstrap {:partition-selector :partition :seed 11 :ratio 0.8}))

_unnamed [3 3]:

:name:group-id:data
:g30Group: :g3, (splitted) [13 5]:
:g21Group: :g2, (splitted) [12 5]:
:g12Group: :g1, (splitted) [8 5]:

Split as a sequence

To get a sequence of pairs, use split->seq function

(-> for-splitting
    (tc/split->seq :kfold {:partition-selector :partition})
    (first))

{:train Group: 0 [20 3]:

:id:partition:group
4:a:g3
5:a:g3
9:a:g1
2:a:g3
12:a:g3
14:a:g3
10:a:g2
1:a:g3
7:a:g1
13:a:g3
15:a:g1
16:a:g3
3:a:g2
19:a:g1
11:a:g2
18:a:g2
23:b:g2
21:b:g1
24:b:g2
22:b:g2

, :test Group: 0 [5 3]:

:id:partition:group
8:a:g3
0:a:g3
6:a:g2
17:a:g1
20:b:g2

}

(-> for-splitting
    (tc/group-by :group)
    (tc/split->seq :bootstrap {:partition-selector :partition :seed 11 :ratio 0.8 :repeats 2})
    (first))

[:g3 ({:train Group: 0 [8 3]:]

Pipeline

tablecloth.pipeline exports special versions of API which create functions operating only on dataset. This creates the possibility to chain operations and compose them easily.

There are two ways to create pipelines:

  • functional, as a composition of functions
  • declarative, separating task declarations and concrete parametrization.

Pipeline operations are prepared to work with metamorph library. That means that result of the pipeline is wrapped into a map and dataset is stored under :metamorph/data key.

Warning: Duplicated metamorph pipeline functions are removed from tablecloth.pipeline namespace.

Functions

This API doesn’t provide any statistical, numerical or date/time functions. Use below namespaces:

Namespacefunctions
tech.v3.datatype.functionalprimitive oprations, reducers, statistics
tech.v3.datatype.datetimedate/time converters and operations

Other examples

Stocks

(defonce stocks (tc/dataset "https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv" {:key-fn keyword}))
stocks

https://raw.githubusercontent.com/techascent/tech.ml.dataset/master/test/data/stocks.csv [560 3]:

:symbol:date:price
MSFT2000-01-0139.81
MSFT2000-02-0136.35
MSFT2000-03-0143.22
MSFT2000-04-0128.37
MSFT2000-05-0125.45
MSFT2000-06-0132.54
MSFT2000-07-0128.40
MSFT2000-08-0128.40
MSFT2000-09-0124.53
MSFT2000-10-0128.02
AAPL2009-05-01135.81
AAPL2009-06-01142.43
AAPL2009-07-01163.39
AAPL2009-08-01168.21
AAPL2009-09-01185.35
AAPL2009-10-01188.50
AAPL2009-11-01199.91
AAPL2009-12-01210.73
AAPL2010-01-01192.06
AAPL2010-02-01204.62
AAPL2010-03-01223.02
(-> stocks
    (tc/group-by (fn [row]
                    {:symbol (:symbol row)
                     :year (tech.v3.datatype.datetime/long-temporal-field :years (:date row))}))
    (tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (tc/order-by [:symbol :year]))

_unnamed [51 3]:

:symbol:yearsummary
AAPL200021.74833333
AAPL200110.17583333
AAPL20029.40833333
AAPL20039.34750000
AAPL200418.72333333
AAPL200548.17166667
AAPL200672.04333333
AAPL2007133.35333333
AAPL2008138.48083333
AAPL2009150.39333333
MSFT200029.67333333
MSFT200125.34750000
MSFT200221.82666667
MSFT200320.93416667
MSFT200422.67416667
MSFT200523.84583333
MSFT200624.75833333
MSFT200729.28416667
MSFT200825.20833333
MSFT200922.87250000
MSFT201028.50666667
(-> stocks
    (tc/group-by (juxt :symbol #(tech.v3.datatype.datetime/long-temporal-field :years (% :date))))
    (tc/aggregate #(tech.v3.datatype.functional/mean (% :price)))
    (tc/rename-columns {:$group-name-0 :symbol
                         :$group-name-1 :year}))

_unnamed [51 3]:

:symbol:yearsummary
MSFT200029.67333333
MSFT200125.34750000
MSFT200221.82666667
MSFT200320.93416667
MSFT200422.67416667
MSFT200523.84583333
MSFT200624.75833333
MSFT200729.28416667
MSFT200825.20833333
MSFT200922.87250000
AAPL200021.74833333
AAPL200110.17583333
AAPL20029.40833333
AAPL20039.34750000
AAPL200418.72333333
AAPL200548.17166667
AAPL200672.04333333
AAPL2007133.35333333
AAPL2008138.48083333
AAPL2009150.39333333
AAPL2010206.56666667

data.table

Below you can find comparizon between functionality of data.table and Clojure dataset API. I leave it without comments, please refer original document explaining details:

Introduction to data.table

R

library(data.table)
library(knitr)

flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")

kable(head(flights))
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
2014111413AAJFKLAX35924759
201411-313AAJFKLAX363247511
20141129AAJFKLAX351247519
201411-8-26AALGAPBI15710357
20141121AAJFKLAX350247513
20141140AAEWRLAX339245418

Clojure

(require '[tech.v3.datatype.functional :as dfn]
         '[tech.v3.datatype.argops :as aops]
         '[tech.v3.datatype :as dtype])

(defonce flights (tc/dataset "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"))
(tc/head flights 6)

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 11]:

yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
2014111413AAJFKLAX35924759
201411-313AAJFKLAX363247511
20141129AAJFKLAX351247519
201411-8-26AALGAPBI15710357
20141121AAJFKLAX350247513
20141140AAEWRLAX339245418

Basics

Shape of loaded data

R

dim(flights)
[1] 253316     11

Clojure

(tc/shape flights)
[253316 11]
What is data.table?

R

DT = data.table(
  ID = c("b","b","b","a","a","c"),
  a = 1:6,
  b = 7:12,
  c = 13:18
)

kable(DT)
IDabc
b1713
b2814
b3915
a41016
a51117
c61218
class(DT$ID)
[1] "character"

Clojure

(def DT (tc/dataset {:ID ["b" "b" "b" "a" "a" "c"]
                      :a (range 1 7)
                      :b (range 7 13)
                      :c (range 13 19)}))
DT

_unnamed [6 4]:

:ID:a:b:c
b1713
b2814
b3915
a41016
a51117
c61218
(-> :ID DT meta :datatype)
:string
Get all the flights with “JFK” as the origin airport in the month of June.

R

ans <- flights[origin == "JFK" & month == 6L]
kable(head(ans))
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
201461-9-5AAJFKLAX32424758
201461-10-13AAJFKLAX329247512
20146118-1AAJFKLAX32624757
201461-6-16AAJFKLAX320247510
201461-4-45AAJFKLAX326247518
201461-6-23AAJFKLAX329247514

Clojure

(-> flights
    (tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
                                   (= (get row "month") 6))))
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 11]:

yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
201461-9-5AAJFKLAX32424758
201461-10-13AAJFKLAX329247512
20146118-1AAJFKLAX32624757
201461-6-16AAJFKLAX320247510
201461-4-45AAJFKLAX326247518
201461-6-23AAJFKLAX329247514
Get the first two rows from flights.

R

ans <- flights[1:2]
kable(ans)
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
2014111413AAJFKLAX35924759
201411-313AAJFKLAX363247511

Clojure

(tc/select-rows flights (range 2))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [2 11]:

yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
2014111413AAJFKLAX35924759
201411-313AAJFKLAX363247511
Sort flights first by column origin in ascending order, and then by dest in descending order

R

ans <- flights[order(origin, -dest)]
kable(head(ans))
yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
201415649EVEWRXNA19511318
201416713EVEWRXNA19011318
201417-6-13EVEWRXNA17911318
201418-7-12EVEWRXNA18411318
201419167EVEWRXNA18111318
20141136666EVEWRXNA18811319

Clojure

(-> flights
    (tc/order-by ["origin" "dest"] [:asc :desc])
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 11]:

yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
201463-6-38EVEWRXNA15411316
2014120-9-17EVEWRXNA17711318
2014319-610EVEWRXNA20111316
201423231268EVEWRXNA184113112
2014425-8-32EVEWRXNA15911316
20142192110EVEWRXNA17611318
Select arr_delay column, but return it as a vector

R

ans <- flights[, arr_delay]
head(ans)
[1]  13  13   9 -26   1   0

Clojure

(take 6 (flights "arr_delay"))
(13 13 9 -26 1 0)
Select arr_delay column, but return as a data.table instead

R

ans <- flights[, list(arr_delay)]
kable(head(ans))
arr_delay
13
13
9
-26
1
0

Clojure

(-> flights
    (tc/select-columns "arr_delay")
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 1]:

arr_delay
13
13
9
-26
1
0
Select both arr_delay and dep_delay columns

R

ans <- flights[, .(arr_delay, dep_delay)]
kable(head(ans))
arr_delaydep_delay
1314
13-3
92
-26-8
12
04

Clojure

(-> flights
    (tc/select-columns ["arr_delay" "dep_delay"])
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 2]:

arr_delaydep_delay
1314
13-3
92
-26-8
12
04
Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep

R

ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]
kable(head(ans))
delay_arrdelay_dep
1314
13-3
92
-26-8
12
04

Clojure

(-> flights
    (tc/select-columns {"arr_delay" "delay_arr"
                         "dep_delay" "delay_arr"})
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 2]:

delay_arrdelay_arr
1314
13-3
92
-26-8
12
04
How many trips have had total delay < 0?

R

ans <- flights[, sum( (arr_delay + dep_delay) < 0 )]
ans
[1] 141814

Clojure

(->> (dfn/+ (flights "arr_delay") (flights "dep_delay"))
     (aops/argfilter #(< % 0.0))
     (dtype/ecount))
141814

or pure Clojure functions (much, much slower)

(->> (map + (flights "arr_delay") (flights "dep_delay"))
     (filter neg?)
     (count))
141814
Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June

R

ans <- flights[origin == "JFK" & month == 6L,
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
kable(ans)
m_arrm_dep
5.8393499.807884

Clojure

(-> flights
    (tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
                                   (= (get row "month") 6))))
    (tc/aggregate {:m_arr #(dfn/mean (% "arr_delay"))
                    :m_dep #(dfn/mean (% "dep_delay"))}))

_unnamed [1 2]:

:m_arr:m_dep
5.839349329.80788411
How many trips have been made in 2014 from “JFK” airport in the month of June?

R

ans <- flights[origin == "JFK" & month == 6L, length(dest)]
ans
[1] 8422

or

ans <- flights[origin == "JFK" & month == 6L, .N]
ans
[1] 8422

Clojure

(-> flights
    (tc/select-rows (fn [row] (and (= (get row "origin") "JFK")
                                   (= (get row "month") 6))))
    (tc/row-count))
8422
deselect columns using - or !

R

ans <- flights[, !c("arr_delay", "dep_delay")]
kable(head(ans))
yearmonthdaycarrierorigindestair_timedistancehour
201411AAJFKLAX35924759
201411AAJFKLAX363247511
201411AAJFKLAX351247519
201411AALGAPBI15710357
201411AAJFKLAX350247513
201411AAEWRLAX339245418

or

ans <- flights[, -c("arr_delay", "dep_delay")]
kable(head(ans))
yearmonthdaycarrierorigindestair_timedistancehour
201411AAJFKLAX35924759
201411AAJFKLAX363247511
201411AAJFKLAX351247519
201411AALGAPBI15710357
201411AAJFKLAX350247513
201411AAEWRLAX339245418

Clojure

(-> flights
    (tc/select-columns (complement #{"arr_delay" "dep_delay"}))
    (tc/head 6))

https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv [6 9]:

yearmonthdaycarrierorigindestair_timedistancehour
201411AAJFKLAX35924759
201411AAJFKLAX363247511
201411AAJFKLAX351247519
201411AALGAPBI15710357
201411AAJFKLAX350247513
201411AAEWRLAX339245418

Aggregations

How can we get the number of trips corresponding to each origin airport?

R

ans <- flights[, .(.N), by = .(origin)]
kable(ans)
originN
JFK81483
LGA84433
EWR87400

Clojure

(-> flights
    (tc/group-by ["origin"])
    (tc/aggregate {:N tc/row-count}))

_unnamed [3 2]:

origin:N
JFK81483
LGA84433
EWR87400
How can we calculate the number of trips for each origin airport for carrier code “AA”?

R

ans <- flights[carrier == "AA", .N, by = origin]
kable(ans)
originN
JFK11923
LGA11730
EWR2649

Clojure

(-> flights
    (tc/select-rows #(= (get % "carrier") "AA"))
    (tc/group-by ["origin"])
    (tc/aggregate {:N tc/row-count}))

_unnamed [3 2]:

origin:N
JFK11923
LGA11730
EWR2649
How can we get the total number of trips for each origin, dest pair for carrier code “AA”?

R

ans <- flights[carrier == "AA", .N, by = .(origin, dest)]
kable(head(ans))
origindestN
JFKLAX3387
LGAPBI245
EWRLAX62
JFKMIA1876
JFKSEA298
EWRMIA848

Clojure

(-> flights
    (tc/select-rows #(= (get % "carrier") "AA"))
    (tc/group-by ["origin" "dest"])
    (tc/aggregate {:N tc/row-count})
    (tc/head 6))

_unnamed [6 3]:

origindest:N
JFKLAX3387
LGAPBI245
EWRLAX62
JFKMIA1876
JFKSEA298
EWRMIA848
How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?

R

ans <- flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]
kable(head(ans,10))
origindestmonthV1V2
JFKLAX16.59036114.2289157
LGAPBI1-7.7586210.3103448
EWRLAX11.3666677.5000000
JFKMIA115.72067018.7430168
JFKSEA114.35714330.7500000
EWRMIA111.01123612.1235955
JFKSFO119.25225228.6396396
JFKBOS112.91964315.2142857
JFKORD131.58620740.1724138
JFKIAH128.85714314.2857143

Clojure

(-> flights
    (tc/select-rows #(= (get % "carrier") "AA"))
    (tc/group-by ["origin" "dest" "month"])
    (tc/aggregate [#(dfn/mean (% "arr_delay"))
                    #(dfn/mean (% "dep_delay"))])
    (tc/head 10))

_unnamed [10 5]:

origindestmonth:summary-0:summary-1
EWRDFW210.5367647111.34558824
LGAORD25.285266468.78996865
JFKBOS211.1800000011.76000000
JFKLAX222.7520000015.08000000
EWRLAX210.333333334.11111111
JFKSJU210.3500000010.81250000
LGAMIA29.6250000010.70000000
JFKSTT215.1153846220.96153846
LGADFW25.371335507.47882736
EWRMIA21.564102564.75641026
So how can we directly order by all the grouping variables?

R

ans <- flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        keyby = .(origin, dest, month)]
kable(head(ans,10))
origindestmonthV1V2
EWRDFW16.42767310.012579
EWRDFW210.53676511.345588
EWRDFW312.8650318.079755
EWRDFW417.79268312.920732
EWRDFW518.48780518.682927
EWRDFW637.00595238.744048
EWRDFW720.25000021.154762
EWRDFW816.93604622.069767
EWRDFW95.86503113.055215
EWRDFW1018.81366518.894410

Clojure

(-> flights
    (tc/select-rows #(= (get % "carrier") "AA"))
    (tc/group-by ["origin" "dest" "month"])
    (tc/aggregate [#(dfn/mean (% "arr_delay"))
                    #(dfn/mean (% "dep_delay"))])
    (tc/order-by ["origin" "dest" "month"])
    (tc/head 10))

_unnamed [10 5]:

origindestmonth:summary-0:summary-1
EWRDFW16.4276729610.01257862
EWRDFW210.5367647111.34558824
EWRDFW312.865030678.07975460
EWRDFW417.7926829312.92073171
EWRDFW518.4878048818.68292683
EWRDFW637.0059523838.74404762
EWRDFW720.2500000021.15476190
EWRDFW816.9360465122.06976744
EWRDFW95.8650306713.05521472
EWRDFW1018.8136646018.89440994
Can by accept expressions as well or does it just take columns?

R

ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]
kable(ans)
dep_delayarr_delayN
TRUETRUE72836
FALSETRUE34583
FALSEFALSE119304
TRUEFALSE26593

Clojure

(-> flights
    (tc/group-by (fn [row]
                    {:dep_delay (pos? (get row "dep_delay"))
                     :arr_delay (pos? (get row "arr_delay"))}))
    (tc/aggregate {:N tc/row-count}))

_unnamed [4 3]:

:dep_delay:arr_delay:N
truetrue72836
falsetrue34583
falsefalse119304
truefalse26593
Do we have to compute mean() for each column individually?

R

kable(DT)
IDabc
b1713
b2814
b3915
a41016
a51117
c61218
DT[, print(.SD), by = ID]
   a b  c
1: 1 7 13
2: 2 8 14
3: 3 9 15
   a  b  c
1: 4 10 16
2: 5 11 17
   a  b  c
1: 6 12 18

Empty data.table (0 rows and 1 cols): ID
kable(DT[, lapply(.SD, mean), by = ID])
IDabc
b2.08.014.0
a4.510.516.5
c6.012.018.0

Clojure

DT

(tc/group-by DT :ID {:result-type :as-map})

_unnamed [6 4]:

:ID:a:b:c
b1713
b2814
b3915
a41016
a51117
c61218

{“b” Group: b [3 4]:

:ID:a:b:c
b1713
b2814
b3915

, “a” Group: a [2 4]:

:ID:a:b:c
a41016
a51117

, “c” Group: c [1 4]:

:ID:a:b:c
c61218

}

(-> DT
    (tc/group-by [:ID])
    (tc/aggregate-columns (complement #{:ID}) dfn/mean))

_unnamed [3 4]:

:ID:a:b:c
b2.08.014.0
a4.510.516.5
c6.012.018.0
How can we specify just the columns we would like to compute the mean() on?

R

kable(head(flights[carrier == "AA",                         ## Only on trips with carrier "AA"
                   lapply(.SD, mean),                       ## compute the mean
                   by = .(origin, dest, month),             ## for every 'origin,dest,month'
                   .SDcols = c("arr_delay", "dep_delay")])) ## for just those specified in .SDcols
origindestmontharr_delaydep_delay
JFKLAX16.59036114.2289157
LGAPBI1-7.7586210.3103448
EWRLAX11.3666677.5000000
JFKMIA115.72067018.7430168
JFKSEA114.35714330.7500000
EWRMIA111.01123612.1235955

Clojure

(-> flights
    (tc/select-rows #(= (get % "carrier") "AA"))
    (tc/group-by ["origin" "dest" "month"])
    (tc/aggregate-columns ["arr_delay" "dep_delay"] dfn/mean)
    (tc/head 6))

_unnamed [6 5]:

origindestmontharr_delaydep_delay
EWRDFW210.5367647111.34558824
LGAORD25.285266468.78996865
JFKBOS211.1800000011.76000000
JFKLAX222.7520000015.08000000
EWRLAX210.333333334.11111111
JFKSJU210.3500000010.81250000
How can we return the first two rows for each month?

R

ans <- flights[, head(.SD, 2), by = month]
kable(head(ans))
monthyeardaydep_delayarr_delaycarrierorigindestair_timedistancehour
1201411413AAJFKLAX35924759
120141-313AAJFKLAX363247511
220141-11AAJFKLAX35824758
220141-53AAJFKLAX358247511
320141-1136AAJFKLAX37524758
320141-314AAJFKLAX368247511

Clojure

(-> flights
    (tc/group-by ["month"])
    (tc/head 2) ;; head applied on each group
    (tc/ungroup)
    (tc/head 6))

_unnamed [6 11]:

yearmonthdaydep_delayarr_delaycarrierorigindestair_timedistancehour
20142204-13DLLGAATL11276220
201422083DLLGAPBI141103510
201431-1136AAJFKLAX37524758
201431-314AAJFKLAX368247511
201441-8-23MQLGABNA11376418
201441-8-11MQLGARDU7143118
How can we concatenate columns a and b for each group in ID?

R

kable(DT[, .(val = c(a,b)), by = ID])
IDval
b1
b2
b3
b7
b8
b9
a4
a5
a10
a11
c6
c12

Clojure

(-> DT
    (tc/pivot->longer [:a :b] {:value-column-name :val})
    (tc/drop-columns [:$column :c]))

_unnamed [12 2]:

:ID:val
b1
b2
b3
a4
a5
c6
b7
b8
b9
a10
a11
c12
What if we would like to have all the values of column a and b concatenated, but returned as a list column?

R

kable(DT[, .(val = list(c(a,b))), by = ID])
IDval
b1, 2, 3, 7, 8, 9
a4, 5, 10, 11
c6, 12

Clojure

(-> DT
    (tc/pivot->longer [:a :b] {:value-column-name :val})
    (tc/drop-columns [:$column :c])
    (tc/fold-by :ID))

_unnamed [3 2]:

:ID:val
b[1 2 3 7 8 9]
a[4 5 10 11]
c[6 12]

API tour

Below snippets are taken from A data.table and dplyr tour written by Atrebas (permission granted).

I keep structure and subtitles but I skip data.table and dplyr examples.

Example data

(def DS (tc/dataset {:V1 (take 9 (cycle [1 2]))
                      :V2 (range 1 10)
                      :V3 (take 9 (cycle [0.5 1.0 1.5]))
                      :V4 (take 9 (cycle ["A" "B" "C"]))}))
(tc/dataset? DS)
(class DS)
true
tech.v3.dataset.impl.dataset.Dataset
DS

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Basic Operations

Filter rows

Filter rows using indices

(tc/select-rows DS [2 3])

_unnamed [2 4]:

:V1:V2:V3:V4
131.5C
240.5A

Discard rows using negative indices

In Clojure API we have separate function for that: drop-rows.

(tc/drop-rows DS (range 2 7))

_unnamed [4 4]:

:V1:V2:V3:V4
110.5A
221.0B
281.0B
191.5C

Filter rows using a logical expression

(tc/select-rows DS (comp #(> % 5) :V2))

_unnamed [4 4]:

:V1:V2:V3:V4
261.5C
170.5A
281.0B
191.5C
(tc/select-rows DS (comp #{"A" "C"} :V4))

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
131.5C
240.5A
261.5C
170.5A
191.5C

Filter rows using multiple conditions

(tc/select-rows DS #(and (= (:V1 %) 1)
                          (= (:V4 %) "A")))

_unnamed [2 4]:

:V1:V2:V3:V4
110.5A
170.5A

Filter unique rows

(tc/unique-by DS)

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C
(tc/unique-by DS [:V1 :V4])

_unnamed [6 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C

Discard rows with missing values

(tc/drop-missing DS)

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C

Other filters

(tc/random DS 3) ;; 3 random rows

_unnamed [3 4]:

:V1:V2:V3:V4
131.5C
110.5A
170.5A
(tc/random DS (/ (tc/row-count DS) 2)) ;; fraction of random rows

_unnamed [5 4]:

:V1:V2:V3:V4
281.0B
151.0B
170.5A
110.5A
240.5A
(tc/by-rank DS :V1 zero?) ;; take top n entries

_unnamed [4 4]:

:V1:V2:V3:V4
221.0B
240.5A
261.5C
281.0B

Convenience functions

(tc/select-rows DS (comp (partial re-matches #"^B") str :V4))

_unnamed [3 4]:

:V1:V2:V3:V4
221.0B
151.0B
281.0B
(tc/select-rows DS (comp #(<= 3 % 5) :V2))

_unnamed [3 4]:

:V1:V2:V3:V4
131.5C
240.5A
151.0B
(tc/select-rows DS (comp #(< 3 % 5) :V2))

_unnamed [1 4]:

:V1:V2:V3:V4
240.5A
(tc/select-rows DS (comp #(<= 3 % 5) :V2))

_unnamed [3 4]:

:V1:V2:V3:V4
131.5C
240.5A
151.0B

Last example skipped.

Sort rows

Sort rows by column

(tc/order-by DS :V3)

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
240.5A
170.5A
221.0B
151.0B
281.0B
131.5C
261.5C
191.5C

Sort rows in decreasing order

(tc/order-by DS :V3 :desc)

_unnamed [9 4]:

:V1:V2:V3:V4
131.5C
261.5C
191.5C
151.0B
221.0B
281.0B
170.5A
240.5A
110.5A

Sort rows based on several columns

(tc/order-by DS [:V1 :V2] [:asc :desc])

_unnamed [9 4]:

:V1:V2:V3:V4
191.5C
170.5A
151.0B
131.5C
110.5A
281.0B
261.5C
240.5A
221.0B
Select columns

Select one column using an index (not recommended)

(nth (tc/columns DS :as-seq) 2) ;; as column (iterable)
#tech.v3.dataset.column<float64>[9]
:V3
[0.5000, 1.000, 1.500, 0.5000, 1.000, 1.500, 0.5000, 1.000, 1.500]
(tc/dataset [(nth (tc/columns DS :as-seq) 2)])

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5

Select one column using column name

(tc/select-columns DS :V2) ;; as dataset

_unnamed [9 1]:

:V2
1
2
3
4
5
6
7
8
9
(tc/select-columns DS [:V2]) ;; as dataset

_unnamed [9 1]:

:V2
1
2
3
4
5
6
7
8
9
(DS :V2) ;; as column (iterable)
#tech.v3.dataset.column<int64>[9]
:V2
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Select several columns

(tc/select-columns DS [:V2 :V3 :V4])

_unnamed [9 3]:

:V2:V3:V4
10.5A
21.0B
31.5C
40.5A
51.0B
61.5C
70.5A
81.0B
91.5C

Exclude columns

(tc/select-columns DS (complement #{:V2 :V3 :V4}))

_unnamed [9 1]:

:V1
1
2
1
2
1
2
1
2
1
(tc/drop-columns DS [:V2 :V3 :V4])

_unnamed [9 1]:

:V1
1
2
1
2
1
2
1
2
1

Other seletions

(->> (range 1 3)
     (map (comp keyword (partial format "V%d")))
     (tc/select-columns DS))

_unnamed [9 2]:

:V1:V2
11
22
13
24
15
26
17
28
19
(tc/reorder-columns DS :V4)

_unnamed [9 4]:

:V4:V1:V2:V3
A110.5
B221.0
C131.5
A240.5
B151.0
C261.5
A170.5
B281.0
C191.5
(tc/select-columns DS #(clojure.string/starts-with? (name %) "V"))

_unnamed [9 4]:

:V1:V2:V3:V4
110.5A
221.0B
131.5C
240.5A
151.0B
261.5C
170.5A
281.0B
191.5C
(tc/select-columns DS #(clojure.string/ends-with? (name %) "3"))

_unnamed [9 1]:

:V3
0.5
1.0
1.5
0.5
1.0
1.5
0.5
1.0
1.5
(tc/select-columns DS #"..2") ;; regex converts to string using `str` function

_unnamed [9 1]:

:V2
1
2
3
4
5
6
7
8
9
(tc/select-columns DS #{:V1 "X"})

_unnamed [9 1]:

:V1
1
2
1
2
1
2
1
2
1
(tc/select-columns DS #(not (clojure.string/starts-with? (name %) "V2")))

_unnamed [9 3]:

:V1:V3:V4
10.5A
21.0B
11.5C
20.5A
11.0B
21.5C
10.5A
21.0B
11.5C
Summarise data

Summarise one column

(reduce + (DS :V1)) ;; using pure Clojure, as value
13
(tc/aggregate-columns DS :V1 dfn/sum) ;; as dataset

_unnamed [1 1]:

:V1
13.0
(tc/aggregate DS {:sumV1 #(dfn/sum (% :V1))})

_unnamed [1 1]:

:sumV1
13.0

Summarize several columns

(tc/aggregate DS [#(dfn/sum (% :V1))
                   #(dfn/standard-deviation (% :V3))])

_unnamed [1 2]:

:summary-0:summary-1
13.00.4330127
(tc/aggregate-columns DS [:V1 :V3] [dfn/sum
                                     dfn/standard-deviation])

_unnamed [1 2]:

:V1:V3
13.00.4330127

Summarise several columns and assign column names

(tc/aggregate DS {:sumv1 #(dfn/sum (% :V1))
                   :sdv3 #(dfn/standard-deviation (% :V3))})

_unnamed [1 2]:

:sumv1:sdv3
13.00.4330127

Summarise a subset of rows

(-> DS
    (tc/select-rows (range 4))
    (tc/aggregate-columns :V1 dfn/sum))

_unnamed [1 1]:

:V1
6.0
Additional helpers
(-> DS
    (tc/first)
    (tc/select-columns :V3)) ;; select first row from `:V3` column

_unnamed [1 1]:

:V3
0.5
(-> DS
    (tc/last)
    (tc/select-columns :V3)) ;; select last row from `:V3` column

_unnamed [1 1]:

:V3
1.5
(-> DS
    (tc/select-rows 4)
    (tc/select-columns :V3)) ;; select forth row from `:V3` column

_unnamed [1 1]:

:V3
1.0
(-> DS
    (tc/select :V3 4)) ;; select forth row from `:V3` column

_unnamed [1 1]:

:V3
1.0
(-> DS
    (tc/unique-by :V4)
    (tc/aggregate tc/row-count)) ;; number of unique rows in `:V4` column, as dataset

_unnamed [1 1]:

summary
3
(-> DS
    (tc/unique-by :V4)
    (tc/row-count)) ;; number of unique rows in `:V4` column, as value
3
(-> DS
    (tc/unique-by)
    (tc/row-count)) ;; number of unique rows in dataset, as value
9
Add/update/delete columns

Modify a column

(tc/map-columns DS :V1 [:V1] #(dfn/pow % 2))

_unnamed [9 4]:

:V1:V2:V3:V4
1.010.5A
4.021.0B
1.031.5C
4.040.5A
1.051.0B
4.061.5C
1.070.5A
4.081.0B
1.091.5C
(def DS (tc/add-column DS :V1 (dfn/pow (DS :V1) 2)))
DS

_unnamed [9 4]:

:V1:V2:V3:V4
1.010.5A
4.021.0B
1.031.5C
4.040.5A
1.051.0B
4.061.5C
1.070.5A
4.081.0B
1.091.5C

Add one column

(tc/map-columns DS :v5 [:V1] dfn/log)

_unnamed [9 5]:

:V1:V2:V3:V4:v5
1.010.5A0.00000000
4.021.0B1.38629436
1.031.5C0.00000000
4.040.5A1.38629436
1.051.0B0.00000000
4.061.5C1.38629436
1.070.5A0.00000000
4.081.0B1.38629436
1.091.5C0.00000000
(def DS (tc/add-column DS :v5 (dfn/log (DS :V1))))
DS

_unnamed [9 5]:

:V1:V2:V3:V4:v5
1.010.5A0.00000000
4.021.0B1.38629436
1.031.5C0.00000000
4.040.5A1.38629436
1.051.0B0.00000000
4.061.5C1.38629436
1.070.5A0.00000000
4.081.0B1.38629436
1.091.5C0.00000000

Add several columns

(def DS (tc/add-columns DS {:v6 (dfn/sqrt (DS :V1))
                                       :v7 "X"}))
DS

_unnamed [9 7]:

:V1:V2:V3:V4:v5:v6:v7
1.010.5A0.000000001.0X
4.021.0B1.386294362.0X
1.031.5C0.000000001.0X
4.040.5A1.386294362.0X
1.051.0B0.000000001.0X
4.061.5C1.386294362.0X
1.070.5A0.000000001.0X
4.081.0B1.386294362.0X
1.091.5C0.000000001.0X

Create one column and remove the others

(tc/dataset {:v8 (dfn/+ (DS :V3) 1)})

_unnamed [9 1]:

:v8
1.5
2.0
2.5
1.5
2.0
2.5
1.5
2.0
2.5

Remove one column

(def DS (tc/drop-columns DS :v5))
DS

_unnamed [9 6]:

:V1:V2:V3:V4:v6:v7
1.010.5A1.0X
4.021.0B2.0X
1.031.5C1.0X
4.040.5A2.0X
1.051.0B1.0X
4.061.5C2.0X
1.070.5A1.0X
4.081.0B2.0X
1.091.5C1.0X

Remove several columns

(def DS (tc/drop-columns DS [:v6 :v7]))
DS

_unnamed [9 4]:

:V1:V2:V3:V4
1.010.5A
4.021.0B
1.031.5C
4.040.5A
1.051.0B
4.061.5C
1.070.5A
4.081.0B
1.091.5C

Remove columns using a vector of colnames

We use set here.

(def DS (tc/select-columns DS (complement #{:V3})))
DS

_unnamed [9 3]:

:V1:V2:V4
1.01A
4.02B
1.03C
4.04A
1.05B
4.06C
1.07A
4.08B
1.09C

Replace values for rows matching a condition

(def DS (tc/map-columns DS :V2 [:V2] #(if (< % 4.0) 0.0 %)))
DS

_unnamed [9 3]:

:V1:V2:V4
1.00.0A
4.00.0B
1.00.0C
4.04.0A
1.05.0B
4.06.0C
1.07.0A
4.08.0B
1.09.0C
by

By group

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate {:sumV2 #(dfn/sum (% :V2))}))

_unnamed [3 2]:

:V4:sumV2
A11.0
B13.0
C15.0

By several groups

(-> DS
    (tc/group-by [:V4 :V1])
    (tc/aggregate {:sumV2 #(dfn/sum (% :V2))}))

_unnamed [6 3]:

:V4:V1:sumV2
A1.07.0
B4.08.0
C1.09.0
A4.04.0
B1.05.0
C4.06.0

Calling function in by

(-> DS
    (tc/group-by (fn [row]
                    (clojure.string/lower-case (:V4 row))))
    (tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))

_unnamed [3 2]:

:$group-name:sumV1
a6.0
b9.0
c6.0

Assigning column name in by

(-> DS
    (tc/group-by (fn [row]
                    {:abc (clojure.string/lower-case (:V4 row))}))
    (tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))

_unnamed [3 2]:

:abc:sumV1
a6.0
b9.0
c6.0
(-> DS
    (tc/group-by (fn [row]
                    (clojure.string/lower-case (:V4 row))))
    (tc/aggregate {:sumV1 #(dfn/sum (% :V1))} {:add-group-as-column :abc}))

_unnamed [3 2]:

:$group-name:sumV1
a6.0
b9.0
c6.0

Using a condition in by

(-> DS
    (tc/group-by #(= (:V4 %) "A"))
    (tc/aggregate #(dfn/sum (% :V1))))

_unnamed [2 2]:

:$group-namesummary
true6.0
false15.0

By on a subset of rows

(-> DS
    (tc/select-rows (range 5))
    (tc/group-by :V4)
    (tc/aggregate {:sumV1 #(dfn/sum (% :V1))}))

_unnamed [3 2]:

:$group-name:sumV1
A5.0
B5.0
C1.0

Count number of observations for each group

(-> DS
    (tc/group-by :V4)
    (tc/aggregate tc/row-count))

_unnamed [3 2]:

:$group-namesummary
A3
B3
C3

Add a column with number of observations for each group

(-> DS
    (tc/group-by [:V1])
    (tc/add-column :n tc/row-count)
    (tc/ungroup))

_unnamed [9 4]:

:V1:V2:V4:n
1.00.0A5
1.00.0C5
1.05.0B5
1.07.0A5
1.09.0C5
4.00.0B4
4.04.0A4
4.06.0C4
4.08.0B4

Retrieve the first/last/nth observation for each group

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns :V2 first))

_unnamed [3 2]:

:V4:V2
A0.0
B0.0
C0.0
(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns :V2 last))

_unnamed [3 2]:

:V4:V2
A7.0
B8.0
C9.0
(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns :V2 #(nth % 1)))

_unnamed [3 2]:

:V4:V2
A4.0
B5.0
C6.0

Going further

Advanced columns manipulation

Summarise all the columns

;; custom max function which works on every type
(tc/aggregate-columns DS :all (fn [col] (first (sort #(compare %2 %1) col))))

_unnamed [1 3]:

:V1:V2:V4
4.09.0C

Summarise several columns

(tc/aggregate-columns DS [:V1 :V2] dfn/mean)

_unnamed [1 2]:

:V1:V2
2.333333334.33333333

Summarise several columns by group

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns [:V1 :V2] dfn/mean))

_unnamed [3 3]:

:V4:V1:V2
A2.03.66666667
B3.04.33333333
C2.05.00000000

Summarise with more than one function by group

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate-columns [:V1 :V2] (fn [col]
                                       {:sum (dfn/sum col)
                                        :mean (dfn/mean col)})))

_unnamed [3 5]:

:V4:V1-sum:V1-mean:V2-sum:V2-mean
A6.02.011.03.66666667
B9.03.013.04.33333333
C6.02.015.05.00000000

Summarise using a condition

(-> DS
    (tc/select-columns :type/numerical)
    (tc/aggregate-columns :all dfn/mean))

_unnamed [1 2]:

:V1:V2
2.333333334.33333333

Modify all the columns

(tc/update-columns DS :all reverse)

_unnamed [9 3]:

:V1:V2:V4
1.09.0C
4.08.0B
1.07.0A
4.06.0C
1.05.0B
4.04.0A
1.00.0C
4.00.0B
1.00.0A

Modify several columns (dropping the others)

(-> DS
    (tc/select-columns [:V1 :V2])
    (tc/update-columns :all dfn/sqrt))

_unnamed [9 2]:

:V1:V2
1.00.00000000
2.00.00000000
1.00.00000000
2.02.00000000
1.02.23606798
2.02.44948974
1.02.64575131
2.02.82842712
1.03.00000000
(-> DS
    (tc/select-columns (complement #{:V4}))
    (tc/update-columns :all dfn/exp))

_unnamed [9 2]:

:V1:V2
2.718281831.00000000
54.598150031.00000000
2.718281831.00000000
54.5981500354.59815003
2.71828183148.41315910
54.59815003403.42879349
2.718281831096.63315843
54.598150032980.95798704
2.718281838103.08392758

Modify several columns (keeping the others)

(def DS (tc/update-columns DS [:V1 :V2] dfn/sqrt))
DS

_unnamed [9 3]:

:V1:V2:V4
1.00.00000000A
2.00.00000000B
1.00.00000000C
2.02.00000000A
1.02.23606798B
2.02.44948974C
1.02.64575131A
2.02.82842712B
1.03.00000000C
(def DS (tc/update-columns DS (complement #{:V4}) #(dfn/pow % 2)))
DS

_unnamed [9 3]:

:V1:V2:V4
1.00.0A
4.00.0B
1.00.0C
4.04.0A
1.05.0B
4.06.0C
1.07.0A
4.08.0B
1.09.0C

Modify columns using a condition (dropping the others)

(-> DS
    (tc/select-columns :type/numerical)
    (tc/update-columns :all #(dfn/- % 1)))

_unnamed [9 2]:

:V1:V2
0.0-1.0
3.0-1.0
0.0-1.0
3.03.0
0.04.0
3.05.0
0.06.0
3.07.0
0.08.0

Modify columns using a condition (keeping the others)

(def DS (tc/convert-types DS :type/numerical :int32))
DS

_unnamed [9 3]:

:V1:V2:V4
10A
40B
10C
44A
15B
45C
17A
48B
19C

Use a complex expression

(-> DS
    (tc/group-by [:V4])
    (tc/head 2)
    (tc/add-column :V2 "X")
    (tc/ungroup))

_unnamed [6 3]:

:V1:V2:V4
1XA
4XA
4XB
1XB
1XC
4XC

Use multiple expressions

(tc/dataset (let [x (dfn/+ (DS :V1) (dfn/sum (DS :V2)))]
               (println (seq (DS :V1)))
               (println (tc/info (tc/select-columns DS :V1)))
               {:A (range 1 (inc (tc/row-count DS)))
                :B x}))

(1 4 1 4 1 4 1 4 1) _unnamed: descriptive-stats [1 11]:

:col-name:datatype:n-valid:n-missing:min:mean:max:standard-deviation:skew:first:last
:V1:int32901.02.333333334.01.581138830.2710523711

_unnamed [9 2]:

:A:B
139.0
242.0
339.0
442.0
539.0
642.0
739.0
842.0
939.0
Chain expressions

Expression chaining using >

(-> DS
    (tc/group-by [:V4])
    (tc/aggregate {:V1sum #(dfn/sum (% :V1))})
    (tc/select-rows #(>= (:V1sum %) 5)))

_unnamed [3 2]:

:V4:V1sum
A6.0
B9.0
C6.0
(-> DS
    (tc/group-by [:V4])
    (tc/aggregate {:V1sum #(dfn/sum (% :V1))})
    (tc/order-by :V1sum :desc))

_unnamed [3 2]:

:V4:V1sum
B9.0
A6.0
C6.0
Indexing and Keys

Set the key/index (order)

(def DS (tc/order-by DS :V4))
DS

_unnamed [9 3]:

:V1:V2:V4
10A
44A
17A
40B
15B
48B
10C
45C
19C

Select the matching rows

(tc/select-rows DS #(= (:V4 %) "A"))

_unnamed [3 3]:

:V1:V2:V4
10A
44A
17A
(tc/select-rows DS (comp #{"A" "C"} :V4))

_unnamed [6 3]:

:V1:V2:V4
10A
44A
17A
10C
45C
19C

Select the first matching row

(-> DS
    (tc/select-rows #(= (:V4 %) "B"))
    (tc/first))

_unnamed [1 3]:

:V1:V2:V4
40B
(-> DS
    (tc/unique-by :V4)
    (tc/select-rows (comp #{"B" "C"} :V4)))

_unnamed [2 3]:

:V1:V2:V4
40B
10C

Select the last matching row

(-> DS
    (tc/select-rows #(= (:V4 %) "A"))
    (tc/last))

_unnamed [1 3]:

:V1:V2:V4
17A

Nomatch argument

(tc/select-rows DS (comp #{"A" "D"} :V4))

_unnamed [3 3]:

:V1:V2:V4
10A
44A
17A

Apply a function on the matching rows

(-> DS
    (tc/select-rows (comp #{"A" "C"} :V4))
    (tc/aggregate-columns :V1 (fn [col]
                                 {:sum (dfn/sum col)})))

_unnamed [1 1]:

:V1-sum
12.0

Modify values for matching rows

(def DS (-> DS
            (tc/map-columns :V1 [:V1 :V4] #(if (= %2 "A") 0 %1))
            (tc/order-by :V4)))
DS

_unnamed [9 3]:

:V1:V2:V4
00A
04A
07A
40B
15B
48B
10C
45C
19C

Use keys in by

(-> DS
    (tc/select-rows (comp (complement #{"B"}) :V4))
    (tc/group-by [:V4])
    (tc/aggregate-columns :V1 dfn/sum))

_unnamed [2 2]:

:V4:V1
A0.0
C6.0

Set keys/indices for multiple columns (ordered)

(tc/order-by DS [:V4 :V1])

_unnamed [9 3]:

:V1:V2:V4
00A
04A
07A
15B
40B
48B
10C
19C
45C

Subset using multiple keys/indices

(-> DS
    (tc/select-rows #(and (= (:V1 %) 1)
                           (= (:V4 %) "C"))))

_unnamed [2 3]:

:V1:V2:V4
10C
19C
(-> DS
    (tc/select-rows #(and (= (:V1 %) 1)
                           (#{"B" "C"} (:V4 %)))))

_unnamed [3 3]:

:V1:V2:V4
15B
10C
19C
(-> DS
    (tc/select-rows #(and (= (:V1 %) 1)
                           (#{"B" "C"} (:V4 %))) {:result-type :as-indexes}))
(4 6 8)
set*() modifications

Replace values

There is no mutating operations tech.ml.dataset or easy way to set value.

(def DS (tc/update-columns DS :V2 #(map-indexed (fn [idx v]
                                                   (if (zero? idx) 3 v)) %)))
DS

_unnamed [9 3]:

:V1:V2:V4
03A
04A
07A
40B
15B
48B
10C
45C
19C

Reorder rows

(def DS (tc/order-by DS [:V4 :V1] [:asc :desc]))
DS

_unnamed [9 3]:

:V1:V2:V4
03A
04A
07A
40B
48B
15B
45C
10C
19C

Modify colnames

(def DS (tc/rename-columns DS {:V2 "v2"}))
DS

_unnamed [9 3]:

:V1v2:V4
03A
04A
07A
40B
48B
15B
45C
10C
19C
(def DS (tc/rename-columns DS {"v2" :V2})) ;; revert back

Reorder columns

(def DS (tc/reorder-columns DS :V4 :V1 :V2))
DS

_unnamed [9 3]:

:V4:V1:V2
A03
A04
A07
B40
B48
B15
C45
C10
C19
Advanced use of by

Select first/last/… row by group

(-> DS
    (tc/group-by :V4)
    (tc/first)
    (tc/ungroup))

_unnamed [3 3]:

:V4:V1:V2
A03
B40
C45
(-> DS
    (tc/group-by :V4)
    (tc/select-rows [0 2])
    (tc/ungroup))

_unnamed [6 3]:

:V4:V1:V2
A03
A07
B40
B15
C45
C19
(-> DS
    (tc/group-by :V4)
    (tc/tail 2)
    (tc/ungroup))

_unnamed [6 3]:

:V4:V1:V2
A04
A07
B48
B15
C10
C19

Select rows using a nested query

(-> DS
    (tc/group-by :V4)
    (tc/order-by :V2)
    (tc/first)
    (tc/ungroup))

_unnamed [3 3]:

:V4:V1:V2
A03
B40
C10

Add a group counter column

(-> DS
    (tc/group-by [:V4 :V1])
    (tc/ungroup {:add-group-id-as-column :Grp}))

_unnamed [9 4]:

:Grp:V4:V1:V2
0A03
0A04
0A07
1B40
1B48
2B15
3C45
4C10
4C19

Get row number of first (and last) observation by group

(-> DS
    (tc/add-column :row-id (range))
    (tc/select-columns [:V4 :row-id])
    (tc/group-by :V4)
    (tc/ungroup))

_unnamed [9 2]:

:V4:row-id
A0
A1
A2
B3
B4
B5
C6
C7
C8
(-> DS
    (tc/add-column :row-id (range))
    (tc/select-columns [:V4 :row-id])
    (tc/group-by :V4)
    (tc/first)
    (tc/ungroup))

_unnamed [3 2]:

:V4:row-id
A0
B3
C6
(-> DS
    (tc/add-column :row-id (range))
    (tc/select-columns [:V4 :row-id])
    (tc/group-by :V4)
    (tc/select-rows [0 2])
    (tc/ungroup))

_unnamed [6 2]:

:V4:row-id
A0
A2
B3
B5
C6
C8

Handle list-columns by group

(-> DS
    (tc/select-columns [:V1 :V4])
    (tc/fold-by :V4))

_unnamed [3 2]:

:V4:V1
A[0 0 0]
B[4 4 1]
C[4 1 1]
(-> DS    
    (tc/group-by :V4)
    (tc/unmark-group))

_unnamed [3 3]:

:name:group-id:data
A0Group: A [3 3]:
B1Group: B [3 3]:
C2Group: C [3 3]:

Grouping sets (multiple by at once)

Not available.

Miscellaneous

Read / Write data

Write data to a csv file

(tc/write! DS "DF.csv")
10

Write data to a tab-delimited file

(tc/write! DS "DF.txt" {:separator \tab})
10

or

(tc/write! DS "DF.tsv")
10

Read a csv / tab-delimited file

(tc/dataset "DF.csv" {:key-fn keyword})

DF.csv [9 3]:

:V4:V1:V2
A03
A04
A07
B40
B48
B15
C45
C10
C19
(tc/dataset "DF.txt" {:key-fn keyword})

DF.txt [9 1]:

:V4 V1 V2
A 0 3
A 0 4
A 0 7
B 4 0
B 4 8
B 1 5
C 4 5
C 1 0
C 1 9
(tc/dataset "DF.tsv" {:key-fn keyword})

DF.tsv [9 3]:

:V4:V1:V2
A03
A04
A07
B40
B48
B15
C45
C10
C19

Read a csv file selecting / droping columns

(tc/dataset "DF.csv" {:key-fn keyword
                       :column-whitelist ["V1" "V4"]})

DF.csv [9 2]:

:V4:V1
A0
A0
A0
B4
B4
B1
C4
C1
C1
(tc/dataset "DF.csv" {:key-fn keyword
                       :column-blacklist ["V4"]})

DF.csv [9 2]:

:V1:V2
03
04
07
40
48
15
45
10
19

Read and rbind several files

(apply tc/concat (map tc/dataset ["DF.csv" "DF.csv"]))

DF.csv [18 3]:

V4V1V2
A03
A04
A07
B40
B48
B15
C45
C10
C19
A03
A04
A07
B40
B48
B15
C45
C10
C19
Reshape data

Melt data (from wide to long)

(def mDS (tc/pivot->longer DS [:V1 :V2] {:target-columns :variable
                                          :value-column-name :value}))
mDS

_unnamed [18 3]:

:V4:variable:value
A:V10
A:V10
A:V10
B:V14
B:V14
B:V11
C:V14
C:V11
C:V11
A:V23
A:V24
A:V27
B:V20
B:V28
B:V25
C:V25
C:V20
C:V29

Cast data (from long to wide)

(-> mDS
    (tc/pivot->wider :variable :value {:fold-fn vec})
    (tc/update-columns ["V1" "V2"] (partial map count)))

_unnamed [3 3]:

:V4:V1:V2
A[0 0 0][3 4 7]
B[4 4 1][0 8 5]
C[4 1 1][5 0 9]
(-> mDS
    (tc/pivot->wider :variable :value {:fold-fn vec})
    (tc/update-columns ["V1" "V2"] (partial map dfn/sum)))

_unnamed [3 3]:

:V4:V1:V2
A[0 0 0][3 4 7]
B[4 4 1][0 8 5]
C[4 1 1][5 0 9]
(-> mDS
    (tc/map-columns :value #(str (> % 5))) ;; coerce to strings
    (tc/pivot->wider :value :variable {:fold-fn vec})
    (tc/update-columns ["true" "false"] (partial map #(if (sequential? %) (count %) 1))))

_unnamed [3 3]:

:V4falsetrue
A51
B51
C51

Split

(tc/group-by DS :V4 {:result-type :as-map})

{“A” Group: A [3 3]:

:V4:V1:V2
A03
A04
A07

, “B” Group: B [3 3]:

:V4:V1:V2
B40
B48
B15

, “C” Group: C [3 3]:

:V4:V1:V2
C45
C10
C19

}


Split and transpose a vector/column

(-> {:a ["A:a" "B:b" "C:c"]}
    (tc/dataset)
    (tc/separate-column :a [:V1 :V2] ":"))

_unnamed [3 2]:

:V1:V2
Aa
Bb
Cc
Other

Skipped

Join/Bind data sets

(def x (tc/dataset {"Id" ["A" "B" "C" "C"]
                     "X1" [1 3 5 7]
                     "XY" ["x2" "x4" "x6" "x8"]}))
(def y (tc/dataset {"Id" ["A" "B" "B" "D"]
                     "Y1" [1 3 5 7]
                     "XY" ["y1" "y3" "y5" "y7"]}))
x y

_unnamed [4 3]:

IdX1XY
A1x2
B3x4
C5x6
C7x8

_unnamed [4 3]:

IdY1XY
A1y1
B3y3
B5y5
D7y7
Join

Join matching rows from y to x

(tc/left-join x y "Id")

left-outer-join [5 6]:

IdX1XYright.IdY1right.XY
A1x2A1y1
B3x4B3y3
B3x4B5y5
C5x6
C7x8

Join matching rows from x to y

(tc/right-join x y "Id")

right-outer-join [4 6]:

IdX1XYright.IdY1right.XY
A1x2A1y1
B3x4B3y3
B3x4B5y5
D7y7

Join matching rows from both x and y

(tc/inner-join x y "Id")

inner-join [3 5]:

IdX1XYY1right.XY
A1x21y1
B3x43y3
B3x45y5

Join keeping all the rows

(tc/full-join x y "Id")

full-join [6 6]:

IdX1XYright.IdY1right.XY
A1x2A1y1
B3x4B3y3
B3x4B5y5
C5x6
C7x8
D7y7

Return rows from x matching y

(tc/semi-join x y "Id")

semi-join [2 3]:

IdX1XY
A1x2
B3x4

Return rows from x not matching y

(tc/anti-join x y "Id")

anti-join [2 3]:

IdX1XY
C5x6
C7x8
More joins

Select columns while joining

(tc/right-join (tc/select-columns x ["Id" "X1"])
                (tc/select-columns y ["Id" "XY"])
                "Id")

right-outer-join [4 4]:

IdX1right.IdXY
A1Ay1
B3By3
B3By5
Dy7
(tc/right-join (tc/select-columns x ["Id" "XY"])
                (tc/select-columns y ["Id" "XY"])
                "Id")

right-outer-join [4 4]:

IdXYright.Idright.XY
Ax2Ay1
Bx4By3
Bx4By5
Dy7

Aggregate columns while joining

(-> y
    (tc/group-by ["Id"])
    (tc/aggregate {"sumY1" #(dfn/sum (% "Y1"))})
    (tc/right-join x "Id")
    (tc/add-column "X1Y1" (fn [ds] (dfn/* (ds "sumY1")
                                                    (ds "X1"))))
    (tc/select-columns ["right.Id" "X1Y1"]))

right-outer-join [4 2]:

right.IdX1Y1
A1.0
B24.0
C
C

Update columns while joining

(-> x
    (tc/select-columns ["Id" "X1"])
    (tc/map-columns "SqX1" "X1" (fn [x] (* x x)))
    (tc/right-join y "Id")
    (tc/drop-columns ["X1" "Id"]))

right-outer-join [4 4]:

SqX1right.IdY1XY
1A1y1
9B3y3
9B5y5
D7y7

Adds a list column with rows from y matching x (nest-join)

(-> (tc/left-join x y "Id")
    (tc/drop-columns ["right.Id"])
    (tc/fold-by (tc/column-names x)))

_unnamed [4 5]:

IdX1XYY1right.XY
A1x2[1][“y1”]
B3x4[3 5][“y3” “y5”]
C5x6[][]
C7x8[][]

Some joins are skipped


Cross join

(def cjds (tc/dataset {:V1 [[2 1 1]]
                        :V2 [[3 2]]}))
cjds

_unnamed [1 2]:

:V1:V2
[2 1 1][3 2]
(reduce #(tc/unroll %1 %2) cjds (tc/column-names cjds))

_unnamed [6 2]:

:V1:V2
23
22
13
12
13
12
(-> (reduce #(tc/unroll %1 %2) cjds (tc/column-names cjds))
    (tc/unique-by))

_unnamed [4 2]:

:V1:V2
23
22
13
12
Bind
(def x (tc/dataset {:V1 [1 2 3]}))
(def y (tc/dataset {:V1 [4 5 6]}))
(def z (tc/dataset {:V1 [7 8 9]
                     :V2 [0 0 0]}))
x y z

_unnamed [3 1]:

:V1
1
2
3

_unnamed [3 1]:

:V1
4
5
6

_unnamed [3 2]:

:V1:V2
70
80
90

Bind rows

(tc/bind x y)

_unnamed [6 1]:

:V1
1
2
3
4
5
6
(tc/bind x z)

_unnamed [6 2]:

:V1:V2
1
2
3
70
80
90

Bind rows using a list

(->> [x y]
     (map-indexed #(tc/add-column %2 :id (repeat %1)))
     (apply tc/bind))

_unnamed [6 2]:

:V1:id
10
20
30
41
51
61

Bind columns

(tc/append x y)

_unnamed [3 2]:

:V1:V1
14
25
36
Set operations
(def x (tc/dataset {:V1 [1 2 2 3 3]}))
(def y (tc/dataset {:V1 [2 2 3 4 4]}))
x y

_unnamed [5 1]:

:V1
1
2
2
3
3

_unnamed [5 1]:

:V1
2
2
3
4
4

Intersection

(tc/intersect x y)

intersection [2 1]:

:V1
2
3

Difference

(tc/difference x y)

difference [1 1]:

:V1
1

Union

(tc/union x y)

union [4 1]:

:V1
1
2
3
4
(tc/concat x y)

_unnamed [10 1]:

:V1
1
2
2
3
3
2
2
3
4
4

Equality not implemented

Can you improve this documentation? These fine people already did:
Ethan Miller & daslu
Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close