Column major dataset abstraction for efficiently manipulating in memory datasets.
Column major dataset abstraction for efficiently manipulating in memory datasets.
(->>dataset dataset)
(->>dataset options dataset)
Please see documentation of ->dataset. Options are the same.
Please see documentation of ->dataset. Options are the same.
(->dataset dataset)
(->dataset dataset {:keys [table-name dataset-name] :as options})
Create a dataset from either csv/tsv or a sequence of maps.
String
or InputStream
will be interpreted as a file (or gzipped file if it
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
is csv or tsv and then engineering around detecting datatypes all of which can
be overridden.Options:
:dataset-name
- set the name of the dataset.:file-type
- Override filetype discovery mechanism for strings or force a particular
parser for an input stream. Note that arrow and parquet must have paths on disk
and cannot currently load from input stream. Acceptible file types are:
#{:csv :tsv :xlsx :xls :arrow :parquet}.:gzipped?
- for file formats that support it, override autodetection and force
creation of a gzipped input stream as opposed to a normal input stream.:column-whitelist
- either sequence of string column names or sequence of column
indices of columns to whitelist.:column-blacklist
- either sequence of string column names or sequence of column
indices of columns to blacklist.:num-rows
- Number of rows to read:header-row?
- Defaults to true, indicates the first row is a header.:key-fn
- function to be applied to column names. Typical use is:
:key-fn keyword
.:separator
- Add a character separator to the list of separators to auto-detect.:csv-parser
- Implementation of univocity's AbstractParser to use. If not
provided a default permissive parser is used. This way you parse anything that
univocity supports (so flat files and such).:bad-row-policy
- One of three options: :skip, :error, :carry-on. Defaults to
:carry-on. Some csv data has ragged rows and in this case we have several
options. If the option is :carry-on then we either create a new column or add
missing values for columns that had no data for that row.:skip-bad-rows?
- Legacy option. Use :bad-row-policy.:max-chars-per-column
- Defaults to 4096. Columns with more characters that this
will result in an exception.:max-num-columns
- Defaults to 8192. CSV,TSV files with more columns than this
will fail to parse. For more information on this option, please visit:
https://github.com/uniVocity/univocity-parsers/issues/301:n-initial-skip-rows
- Skip N rows initially. This currently may include the header
row. Works across both csv and spreadsheet datasets.:parser-fn
-
keyword?
- all columns parsed to this datatypeifn?
- called with two arguments: (parser-fn column-name-or-idx column-data)
- Return value must be implement tech.ml.dataset.parser.PColumnParser in
which case that is used or can return nil in which case the default
column parser is used.parse-data
] in which case container of type
[datatype] will be created. parse-data
can be one of:
:relaxed?
- data will be parsed such that parse failures of the standard
parse functions do not stop the parsing process. :unparsed-values and
:unparsed-indexes are available in the metadata of the column that tell
you the values that failed to parse and their respective indexes.fn?
- function from str-> one of :tech.ml.dataset.parser/missing
,
:tech.ml.dataset.parser/parse-failure
, or the parsed value.
Exceptions here always kill the parse process. :missing will get marked
in the missing indexes, and :parse-failure will result in the index being
added to missing, the unparsed the column's :unparsed-values and
:unparsed-indexes will be updated.string?
- for datetime types, this will turned into a DateTimeFormatter via
DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid
argument to Charset/forName.DateTimeFormatter
- use with the appropriate temporal parse static function
to parse the value.parse-data
can be a string, a
java.nio.charset.Charset, or an implementation of
tech.ml.dataset.text/PEncodingToFn. If you want to serialize this format
to nippy your encoding had better be nippy serializable (defrecords
always are).map?
- the header-name-or-idx is used to lookup value. If not nil, then
value can be any of the above options. Else the default column parser
is used.:parser-scan-len
- Length of initial column data used for parser-fn's datatype
detection routine. Defaults to 100.Returns a new dataset
Create a dataset from either csv/tsv or a sequence of maps. * A `String` or `InputStream` will be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden. * A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created. Options: - `:dataset-name` - set the name of the dataset. - `:file-type` - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :arrow :parquet}. - `:gzipped?` - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream. - `:column-whitelist` - either sequence of string column names or sequence of column indices of columns to whitelist. - `:column-blacklist` - either sequence of string column names or sequence of column indices of columns to blacklist. - `:num-rows` - Number of rows to read - `:header-row?` - Defaults to true, indicates the first row is a header. - `:key-fn` - function to be applied to column names. Typical use is: `:key-fn keyword`. - `:separator` - Add a character separator to the list of separators to auto-detect. - `:csv-parser` - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such). - `:bad-row-policy` - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row. - `:skip-bad-rows?` - Legacy option. Use :bad-row-policy. - `:max-chars-per-column` - Defaults to 4096. Columns with more characters that this will result in an exception. - `:max-num-columns` - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301 - `:n-initial-skip-rows` - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets. - `:parser-fn` - - `keyword?` - all columns parsed to this datatype - `ifn?` - called with two arguments: (parser-fn column-name-or-idx column-data) - Return value must be implement tech.ml.dataset.parser.PColumnParser in which case that is used or can return nil in which case the default column parser is used. - tuple - pair of [datatype `parse-data`] in which case container of type [datatype] will be created. `parse-data` can be one of: - `:relaxed?` - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes. - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`, `:tech.ml.dataset.parser/parse-failure`, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated. - `string?` - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName. - `DateTimeFormatter` - use with the appropriate temporal parse static function to parse the value. - :encoded-text datatype - `parse-data` can be a string, a java.nio.charset.Charset, or an implementation of tech.ml.dataset.text/PEncodingToFn. If you want to serialize this format to nippy your encoding had better be nippy serializable (defrecords always are). - `map?` - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used. - `:parser-scan-len` - Length of initial column data used for parser-fn's datatype detection routine. Defaults to 100. Returns a new dataset
(->flyweight dataset
&
{:keys [column-name-seq error-on-missing-values? number->string?]
:or {error-on-missing-values? true}})
Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. If the dataset has a label and number->string? is true then columns that have been converted from categorical to numeric will be reverse-mapped back to string columns.
Convert dataset to seq-of-maps dataset. Flag indicates if errors should be thrown on missing values or if nil should be inserted in the map. If the dataset has a label and number->string? is true then columns that have been converted from categorical to numeric will be reverse-mapped back to string columns.
(->k-fold-datasets dataset k)
(->k-fold-datasets
dataset
k
{:keys [randomize-dataset?] :or {randomize-dataset? true} :as options})
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
Given 1 dataset, prepary K datasets using the k-fold algorithm. Randomize dataset defaults to true which will realize the entire dataset so use with care if you have large datasets.
(->row-major dataset)
(->row-major dataset options)
(->row-major dataset
key-colname-seq-map
{:keys [datatype] :or {datatype :float64}})
Given a dataset and a map of desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}
Given a dataset and a map of desired key names to sequences of columns, produce a sequence of maps where each key name points to contiguous vector composed of the column values concatenated. If colname-seq-map is not provided then each row defaults to {:features [feature-columns] :label [label-columns]}
(->sort-by dataset key-fn)
(->sort-by dataset key-fn compare-fn)
(->sort-by dataset key-fn compare-fn column-name-seq)
Version of sort-by used in -> statements common in dataflows
Version of sort-by used in -> statements common in dataflows
(->sort-by-column dataset colname)
(->sort-by-column dataset colname compare-fn)
sort-by-column used in -> dataflows
sort-by-column used in -> dataflows
(->train-test-split dataset)
(->train-test-split dataset
{:keys [randomize-dataset? train-fraction]
:or {randomize-dataset? true train-fraction 0.7}
:as options})
(add-column dataset column)
Add a new column. Error if name collision
Add a new column. Error if name collision
(add-or-update-column dataset column)
(add-or-update-column dataset colname column)
If column exists, replace. Else append new column.
If column exists, replace. Else append new column.
(aggregate-by map-fn
dataset
&
{:keys [column-name-seq numeric-aggregate-fn boolean-aggregate-fn
default-aggregate-fn count-column-name]
:or {numeric-aggregate-fn dfn/reduce-+
boolean-aggregate-fn count-true
default-aggregate-fn first}})
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
(aggregate-by-column colname
dataset
&
{:keys [numeric-aggregate-fn boolean-aggregate-fn
default-aggregate-fn count-column-name]
:or {numeric-aggregate-fn dfn/reduce-+
boolean-aggregate-fn count-true
default-aggregate-fn first}})
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
(all-descriptive-stats-names)
Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})
Returns the names of all descriptive stats in the order they will be returned in the resulting dataset of descriptive stats. This allows easy filtering in the form for (descriptive-stats ds {:stat-names (->> (all-descriptive-stats-names) (remove #{:values :num-distinct-values}))})
(assoc dataset colname coldata)
(assoc dataset colname coldata & more)
If column exists, replace. Else append new column. The datatype of the new column will the the datatype of the coldata.
coldata may be a sequence in which case 'vec' will be called and the datatype will be :object.
coldata may be a reader, in which case the datatype will be the datatype of the reader. One way to make a reader is tech.v2.datatype/make-reader or anything deriving from java.util.List and java.util.RandomAccess will do.
coldata may also be a new column (tech.ml.dataset.column/new-column) in which case the missing set and the column metadata can be provided.
If column exists, replace. Else append new column. The datatype of the new column will the the datatype of the coldata. coldata may be a sequence in which case 'vec' will be called and the datatype will be :object. coldata may be a reader, in which case the datatype will be the datatype of the reader. One way to make a reader is tech.v2.datatype/make-reader or anything deriving from java.util.List and java.util.RandomAccess will do. coldata may also be a new column (tech.ml.dataset.column/new-column) in which case the missing set and the column metadata can be provided.
(brief ds)
(brief ds options)
Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.
Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.
(column dataset column-name)
Return the column or throw if it doesn't exist.
Return the column or throw if it doesn't exist.
(column->dataset dataset colname transform-fn)
(column->dataset dataset colname transform-fn options)
Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.
Transform a column into a sequence of maps using transform-fn. Return dataset created out of the sequence of maps.
(column-cast dataset colname datatype)
Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column.
colname may be a scalar or a tuple of [src-col dst-col].
datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes.
If the existing datatype is string, then tech.ml.datatype.column/parse-column is called.
Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.ml.dataset.column/parse-column.
Cast a column to a new datatype. This is never a lazy operation. If the old and new datatypes match and no cast-fn is provided then dtype/clone is called on the column. colname may be a scalar or a tuple of [src-col dst-col]. datatype may be a datatype enumeration or a tuple of [datatype cast-fn] where cast-fn may return either a new value, :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure. Exceptions are propagated to the caller. The new column has at least the existing missing set (if no attempt returns :missing or :cast-failure). :cast-failure means the value gets added to metadata key :unparsed-data and the index gets added to :unparsed-indexes. If the existing datatype is string, then tech.ml.datatype.column/parse-column is called. Casts between numeric datatypes need no cast-fn but one may be provided. Casts to string need no cast-fn but one may be provided. Casts from string to anything will call tech.ml.dataset.column/parse-column.
(column-labeled-mapseq dataset value-colname-seq)
Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader!
See also columnwise-concat
Return a sequence of maps with
{... - columns not in colname-seq
:value - value from one of the value columns
:label - name of the column the value came from
}
Given a dataset, return a sequence of maps where several columns are all stored in a :value key and a :label key contains a column name. Used for quickly creating timeseries or scatterplot labeled graphs. Returns a lazy sequence, not a reader! See also `columnwise-concat` Return a sequence of maps with ```clojure {... - columns not in colname-seq :value - value from one of the value columns :label - name of the column the value came from } ```
(column-map dataset result-colname map-fn colname & colnames)
Produce a new column as the result of mapping a fn over other columns. The result column will have a datatype of the widened combination of all the input column datatypes. The result column's missing indexes is the union of all input columns.
Produce a new column as the result of mapping a fn over other columns. The result column will have a datatype of the widened combination of all the input column datatypes. The result column's missing indexes is the union of all input columns.
(column-name->column-map datatypes)
clojure map of column-name->column
clojure map of column-name->column
(column-names dataset)
In-order sequence of column names
In-order sequence of column names
(column-values->categorical dataset src-column)
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.
Given a column encoded via either string->number or one-hot, reverse map to the a sequence of the original string column values.
(columns dataset)
Return sequence of all columns in dataset.
Return sequence of all columns in dataset.
(columns-with-missing-seq dataset)
Return a sequence of:
{:column-name column-name
:missing-count missing-count
}
or nil of no columns are missing data.
Return a sequence of: ```clojure {:column-name column-name :missing-count missing-count } ``` or nil of no columns are missing data.
(columnwise-concat dataset colnames)
(columnwise-concat dataset
colnames
{:keys [value-column-name colname-column-name]
:or {value-column-name :value colname-column-name :column}
:as _options})
Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated.
Example:
user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}]
(ds/->dataset)
(ds/columnwise-concat [:c :a :b]))
null [6 3]:
| :column | :value | :d |
|---------+--------+----|
| :c | 3 | 1 |
| :c | 6 | 2 |
| :a | 1 | 1 |
| :a | 4 | 2 |
| :b | 2 | 1 |
| :b | 5 | 2 |
Options:
value-column-name - defaults to :value colname-column-name - defaults to :column
Given a dataset and a list of columns, produce a new dataset with the columns concatenated to a new column with a :column column indicating which column the original value came from. Any columns not mentioned in the list of columns are duplicated. Example: ```clojure user> (-> [{:a 1 :b 2 :c 3 :d 1} {:a 4 :b 5 :c 6 :d 2}] (ds/->dataset) (ds/columnwise-concat [:c :a :b])) null [6 3]: | :column | :value | :d | |---------+--------+----| | :c | 3 | 1 | | :c | 6 | 2 | | :a | 1 | 1 | | :a | 4 | 2 | | :b | 2 | 1 | | :b | 5 | 2 | ``` Options: value-column-name - defaults to :value colname-column-name - defaults to :column
(compute-centroid-and-global-means dataset row-major-centroids)
Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.
Return a map of: centroid-means - centroid-index -> (double array) column means. global-means - global means (double array) for the dataset.
(concat dataset & datasets)
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.
(concat-copying dataset & datasets)
Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
(concat-inplace dataset & datasets)
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
(correlation-table dataset & {:keys [correlation-type colname-seq]})
Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >)
Thus the first entry is: [colname, 1.0]
There are three possible correlation types: :pearson :spearman :kendall
:pearson is the default.
Return a map of colname->list of sorted tuple of [colname, coefficient]. Sort is: (sort-by (comp #(Math/abs (double %)) second) >) Thus the first entry is: [colname, 1.0] There are three possible correlation types: :pearson :spearman :kendall :pearson is the default.
(data->dataset {:keys [metadata columns]})
Convert a data-ized dataset created via dataset->data back into a full dataset
Convert a data-ized dataset created via dataset->data back into a full dataset
(dataset->data ds)
Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}.
:columns is a vector of column definitions appropriate for passing directly back into new-dataset.
A column definition in this case is a map of {:name :missing :data :metadata}.
Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.
(dataset->smile-dataframe ds)
Convert a dataset to a smile dataframe.
This operation may clone columns if they aren't backed by java heap arrays. See ensure-array-backed
It is important to note that smile supports a subset of the functionality in tech.ml.dataset. One difference is smile columns have string column names and have no missing set.
Returns a smile.data.DataFrame
Convert a dataset to a smile dataframe. This operation may clone columns if they aren't backed by java heap arrays. See ensure-array-backed It is important to note that smile supports a subset of the functionality in tech.ml.dataset. One difference is smile columns have string column names and have no missing set. Returns a smile.data.DataFrame
(dataset->str ds)
(dataset->str ds options)
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.
For options documentation see dataset-data->str.
Convert a dataset to a string. Prints a single line header and then calls dataset-data->str. For options documentation see dataset-data->str.
Deprecated method. See dataset->str
Deprecated method. See dataset->str
(dataset-data->str dataset)
(dataset-data->str dataset options)
Convert the dataset values to a string.
Options may be provided in the dataset metadata or may be provided as an options map. The options map overrides the dataset metadata.
:print-index-range - The set of indexes to print. Defaults to: (range default-table-row-print-length) :print-line-policy - defaults to :repl - one of - :repl - multiline table - default nice printing for repl - :markdown - lines delimited by <br> - :single - Only print first line :print-column-max-width - set the max width of a column when printing.
Example for conservative printing: tech.ml.dataset.github-test> (def ds (with-meta ds (assoc (meta ds) :print-column-max-width 25 :print-line-policy :single)))
Convert the dataset values to a string. Options may be provided in the dataset metadata or may be provided as an options map. The options map overrides the dataset metadata. :print-index-range - The set of indexes to print. Defaults to: (range *default-table-row-print-length*) :print-line-policy - defaults to :repl - one of - :repl - multiline table - default nice printing for repl - :markdown - lines delimited by <br> - :single - Only print first line :print-column-max-width - set the max width of a column when printing. Example for conservative printing: tech.ml.dataset.github-test> (def ds (with-meta ds (assoc (meta ds) :print-column-max-width 25 :print-line-policy :single)))
(descriptive-stats dataset)
(descriptive-stats dataset options)
Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.
Get descriptive statistics across the columns of the dataset. In addition to the standard stats. Options: :stat-names - defaults to (remove #{:values :num-distinct-values} (all-descriptive-stats-names)) :n-categorical-values - Number of categorical values to report in the 'values' field. Defaults to 21.
(dissoc dataset colname)
(dissoc dataset colname & more)
Remove one more more columns from the dataset.
Remove one more more columns from the dataset.
(drop-columns dataset col-name-seq)
Same as remove-columns
Same as remove-columns
(drop-missing ds)
Drop rows with missing entries
Drop rows with missing entries
(drop-rows dataset row-indexes)
Same as remove-rows.
Same as remove-rows.
(ds-concat dataset & other-datasets)
Legacy method. Please see concat
Legacy method. Please see concat
(ds-take-nth n-val dataset)
Legacy method. Please see take-nth
Legacy method. Please see take-nth
(ensure-array-backed ds)
(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})
Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.
Columns that are already array backed and that have no missing values are not changed and retuned.
The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.
options - :unpack? - unpack packed datetime types. Defaults to true
Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays. Columns that are already array backed and that have no missing values are not changed and retuned. The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column. options - :unpack? - unpack packed datetime types. Defaults to true
(feature-ecount dataset)
When columns aren't scalars then this will change. For now, just the number of feature columns.
When columns aren't scalars then this will change. For now, just the number of feature columns.
(fill-range-replace ds colname max-span)
(fill-range-replace ds colname max-span missing-strategy)
(fill-range-replace ds colname max-span missing-strategy missing-value)
Given an in-order column of a numeric or datetime type, fill in spans that are larger than the given max-span. The source column must not have missing values. For more documentation on fill-range, see tech.v2.datatype.function.fill-range.
If the column is a datetime type the operation happens in millisecond space and max-span may be a datetime type convertible to milliseconds.
The result column has the same datatype as the input column.
After the operation, if missing strategy is not nil the newly produced missing
values along with the existing missing values will be replaced using the given
missing strategy for all other columns. See
tech.ml.dataset.missing/replace-missing
for documentation on missing strategies.
The missing strategy defaults to :down unless explicity set.
Returns a new dataset.
Given an in-order column of a numeric or datetime type, fill in spans that are larger than the given max-span. The source column must not have missing values. For more documentation on fill-range, see tech.v2.datatype.function.fill-range. If the column is a datetime type the operation happens in millisecond space and max-span may be a datetime type convertible to milliseconds. The result column has the same datatype as the input column. After the operation, if missing strategy is not nil the newly produced missing values along with the existing missing values will be replaced using the given missing strategy for all other columns. See `tech.ml.dataset.missing/replace-missing` for documentation on missing strategies. The missing strategy defaults to :down unless explicity set. Returns a new dataset.
(filter predicate dataset)
(filter predicate column-name-seq dataset)
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
(filter-column predicate colname dataset)
Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.
Filter a given column by a predicate. Predicate is passed column values. If predicate is *not* an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.
(from-prototype dataset table-name column-seq)
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
(g-means dataset & [max-k error-on-missing?])
g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
g-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
(group-by key-fn dataset)
(group-by key-fn column-name-seq dataset)
Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.
Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.
(group-by->indexes key-fn dataset)
(group-by->indexes key-fn column-name-seq dataset)
(group-by-column colname dataset)
Return a map of column-value->dataset.
Return a map of column-value->dataset.
(hash-join colname lhs rhs)
(hash-join colname
lhs
rhs
{:keys [operation-space] :or {operation-space :int32} :as options})
Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }
Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :lhs-missing? Calculate the missing lhs indexes and left outer join table. :rhs-missing? Calculate the missing rhs indexes and right outer join table. :operation-space - either :int32 or :int64. Defaults to :int32. Returns {:join-table - joined-table :lhs-indexes - matched lhs indexes :rhs-indexes - matched rhs indexes ;; -- when rhs-missing? is true -- :rhs-missing - missing indexes of rhs. :rhs-outer-join - rhs outer join table. ;; -- when lhs-missing? is true -- :lhs-missing - missing indexes of lhs. :lhs-outer-join - lhs outer join table. }
(head dataset)
(head n dataset)
Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.
Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.
(impute-missing-by-centroid-averages dataset
row-major-centroids
{:keys [centroid-means global-means]})
Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.
Impute missing columns by first grouping by nearest centroids and then computing the mean. In the case where the grouping for a given centroid contains all NaN's, use the global dataset mean. In the case where this is NaN, this algorithm will fail to replace the missing values with meaningful values. Return a new dataset.
(inference-target-label-inverse-map dataset & [label-columns])
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
Given options generated during ETL operations and annotated with :label-columns sequence container 1 label column, generate a reverse map that maps from a dataset value back to the label that generated that value.
(inner-join colname lhs rhs)
(inner-join colname lhs rhs options)
Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
Inner join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
(interpolate-loess ds x-colname y-colname)
(interpolate-loess ds
x-colname
y-colname
{:keys [bandwidth iterations accuracy result-name]
:or {bandwidth 0.75
iterations 4
accuracy LoessInterpolator/DEFAULT_ACCURACY}})
Interpolate using the LOESS regression engine. Useful for smoothing out graphs.
Interpolate using the LOESS regression engine. Useful for smoothing out graphs.
(invert-string->number ds)
When ds-pipe/string->number is called it creates label maps. This reverts the dataset back to those labels. Currently results in object columns so a cast operation may be needed to convert to desired datatype.
When ds-pipe/string->number is called it creates label maps. This reverts the dataset back to those labels. Currently results in object columns so a cast operation may be needed to convert to desired datatype.
(k-means dataset & [k max-iterations num-runs error-on-missing? tolerance])
Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.
Nan-aware k-means. Returns array of centroids in row-major array-of-array-of-doubles format.
(labels dataset)
Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in a dataset. Return a reader of labels
Given a dataset and an options map, generate a sequence of label-values. If label count is 1, then if there is a label-map associated with column generate sequence of labels by reverse mapping the column(s) back to the original dataset values. If there are multiple label columns results are presented in a dataset. Return a reader of labels
(left-join colname lhs rhs)
(left-join colname lhs rhs options)
Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
Left join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
(left-join-asof colname lhs rhs)
(left-join-asof colname lhs rhs {:keys [asof-op] :or {asof-op :<=} :as options})
Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes.
options:
asof-op
- may be [:< :<= :nearest :>= :>] - type of join operation. Defaults to
<=.Perform a left join asof. Similar to left join except this will join on nearest value. lhs and rhs must be sorted by join-column. join columns must be either datetime columns in which the join happens in millisecond space or they must be numeric - integer or floating point datatypes. options: - `asof-op`- may be [:< :<= :nearest :>= :>] - type of join operation. Defaults to <=.
(mapseq-reader dataset)
(mapseq-reader dataset options)
Return a reader that produces a map of column-name->column-value
Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
Return a reader that produces a map of column-name->column-value Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
(maybe-column dataset column-name)
Return either column if exists or nil.
Return either column if exists or nil.
(model-type dataset & [column-name-seq])
Check the label column after dataset processing. Return either :regression :classification
Check the label column after dataset processing. Return either :regression :classification
(n-feature-permutations n dataset)
Given a dataset with at least one inference target column, produce all datasets with n feature columns and the label columns.
Given a dataset with at least one inference target column, produce all datasets with n feature columns and the label columns.
(n-permutations n dataset)
Return n datasets with all permutations n of the columns possible. N must be less than (column-count dataset)).
Return n datasets with all permutations n of the columns possible. N must be less than (column-count dataset)).
(name-values-seq->dataset name-values-seq & {:as options})
Given a sequence of [name data-seq], produce a columns. If data-seq is of unknown (:object) datatype, the first item is checked. If it is a number, then doubles are used. If it is a string, then strings are used for the column datatype. All sequences must be the same length. Returns a new dataset
Given a sequence of [name data-seq], produce a columns. If data-seq is of unknown (:object) datatype, the first item is checked. If it is a number, then doubles are used. If it is a string, then strings are used for the column datatype. All sequences must be the same length. Returns a new dataset
(new-column dataset column-name values)
Create a new column from some values
Create a new column from some values
(new-dataset column-seq)
(new-dataset options column-seq)
(new-dataset options ds-metadata column-seq)
Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset.
The return value fulfills the dataset protocols.
Create a new dataset from a sequence of columns. Data will be converted into columns using ds-col-proto/ensure-column-seq. If the column seq is simply a collection of vectors, for instance, columns will be named ordinally. options map - :dataset-name - Name of the dataset. Defaults to "_unnamed". :key-fn - Key function used on all column names before insertion into dataset. The return value fulfills the dataset protocols.
(num-inference-classes dataset)
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
Given a dataset and correctly built options from pipeline operations, return the number of classes used for the label. Error if not classification dataset.
(order-column-names dataset colname-seq)
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
(parallelized-load-csv input)
(parallelized-load-csv input options)
In load a csv distributing rows between N different datasets. Concat them at the end and return the final dataset. Loads data into an out-of-order dataset.
Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases.
Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.
In load a csv distributing rows between N different datasets. Concat them at the end and return the final dataset. Loads data into an out-of-order dataset. Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases. Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.
(rand-nth dataset)
Return a random row from the dataset in map format
Return a random row from the dataset in map format
(reduce-column-names dataset colname-seq)
Reverse map from the one-hot encoded columns to the original source column.
Reverse map from the one-hot encoded columns to the original source column.
(remove-columns dataset colname-seq)
Same as drop-columns
Same as drop-columns
(remove-rows dataset row-indexes)
Same as drop-rows.
Same as drop-rows.
(rename-columns dataset colname-map)
Rename columns using a map. Does not reorder columns.
Rename columns using a map. Does not reorder columns.
(replace-missing ds)
(replace-missing ds strategy)
(replace-missing ds columns-selector strategy)
(replace-missing ds columns-selector strategy value)
Replace missing values in some columns with a given strategy. The columns selector may be any legal argument to select-columns. Strategies may be:
:down
- take value from previous non-missing row if possible else use next
non-missing row.:up
- take value from next non-missing row if possible else use previous
non-missing row.:mid
- Use midpoint of averaged values between previous and next nonmissing
rows.:lerp
- Linearly interpolate values between previous and next nonmissing rows.:value
- Value will be provided - see below.
value may be provided which will then be used. Value may be a function in which
case it will be called on the column with missing values elided and the return will
be used to as the filler.Replace missing values in some columns with a given strategy. The columns selector may be any legal argument to select-columns. Strategies may be: - `:down` - take value from previous non-missing row if possible else use next non-missing row. - `:up` - take value from next non-missing row if possible else use previous non-missing row. - `:mid` - Use midpoint of averaged values between previous and next nonmissing rows. - `:lerp` - Linearly interpolate values between previous and next nonmissing rows. - `:value` - Value will be provided - see below. value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.
(reverse-map-categorical-columns dataset {:keys [column-name-seq]})
Given a dataset where we have converted columns from a categorical representation to either a numeric reprsentation or a one-hot representation, reverse map back to the original dataset given the reverse mapping of label->number in the column's metadata.
Given a dataset where we have converted columns from a categorical representation to either a numeric reprsentation or a one-hot representation, reverse map back to the original dataset given the reverse mapping of label->number in the column's metadata.
(right-join colname lhs rhs)
(right-join colname lhs rhs options)
Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
Right join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let [[lhs-colname rhs-colname]] colname] ...) An options map can be passed in with optional arguments: :operation-space - either :int32 or :int64. Defaults to :int32. Returns the joined table
(sample dataset)
(sample n dataset)
(sample n replacement? dataset)
Sample n-rows from a dataset. Defaults to sampling without replacement.
Sample n-rows from a dataset. Defaults to sampling *without* replacement.
(select dataset colname-seq index-seq)
Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:
rename-columns
except this trims the result to be only the columns
in the map.
index-seq - either keyword :all or list of indexes. May contain duplicates.Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of: - :all - all the columns - sequence of column names - those columns in that order. - implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to `rename-columns` except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.
(select-missing ds)
Select only rows with missing values
Select only rows with missing values
(shape dataset)
Returns shape in row-major format of [n-columns n-rows].
Returns shape in row-major format of [n-columns n-rows].
(sort-by key-fn dataset)
(sort-by key-fn compare-fn dataset)
(sort-by key-fn compare-fn column-name-seq dataset)
Sort a dataset by a key-fn and compare-fn.
Sort a dataset by a key-fn and compare-fn.
(sort-by-column colname dataset)
(sort-by-column colname compare-fn dataset)
Sort a dataset by a given column using the given compare fn.
Sort a dataset by a given column using the given compare fn.
(tail dataset)
(tail n dataset)
Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.
Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.
(unique-by map-fn dataset)
(unique-by map-fn
{:keys [column-name-seq keep-fn]
:or {keep-fn (fn* [p1__39025# p2__39024#] (first p2__39024#))}
:as _options}
dataset)
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.
:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep. :keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).
(unique-by-column colname dataset)
(unique-by-column colname
{:keys [keep-fn]
:or {keep-fn (fn* [p1__39038# p2__39037#]
(first p2__39037#))}
:as _options}
dataset)
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.
:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep. :keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
(unordered-select dataset colname-seq index-seq)
Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
Perform a selection but use the order of the columns in the existing table; do *not* reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
(unroll-column dataset column-name)
(unroll-column dataset column-name options)
Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data.
Any missing indexes are dropped.
user> (-> (ds/->dataset [{:a 1 :b [2 3]}
{:a 2 :b [4 5]}
{:a 3 :b :a}])
(ds/unroll-column :b {:indexes? true}))
_unnamed [5 3]:
| :a | :b | :indexes |
|----+----+----------|
| 1 | 2 | 0 |
| 1 | 3 | 1 |
| 2 | 4 | 0 |
| 2 | 5 | 1 |
| 3 | :a | 0 |
Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.
Unroll a column that has some (or all) sequential data as entries. Returns a new dataset with same columns but with other columns duplicated where the unroll happened. Column now contains only scalar data. Any missing indexes are dropped. ```clojure user> (-> (ds/->dataset [{:a 1 :b [2 3]} {:a 2 :b [4 5]} {:a 3 :b :a}]) (ds/unroll-column :b {:indexes? true})) _unnamed [5 3]: | :a | :b | :indexes | |----+----+----------| | 1 | 2 | 0 | | 1 | 3 | 1 | | 2 | 4 | 0 | | 2 | 5 | 1 | | 3 | :a | 0 | ``` Options - :datatype - datatype of the resulting column if one aside from :object is desired. :indexes? - If true, create a new column that records the indexes of the values from the original column. Can also be a truthy value (like a keyword) and the column will be named this.
(update-column dataset col-name update-fn)
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
(update-columns dataset column-name-seq update-fn)
Update a sequence of columns.
Update a sequence of columns.
(value-reader dataset)
(value-reader dataset options)
Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
(write-csv! ds output)
(write-csv! ds output options)
Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string -
Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string - - .csv, .tsv - switches between tsv, csv. Tsv is the default. - *.gz - write to a gzipped stream. At this time writing to json is not supported. options - :separator - in case output isn't a string, you can use either \, or \tab to switch between csv or tsv output respectively.
(x-means dataset & [max-k error-on-missing?])
x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
x-means. Not NAN aware, missing is an error. Returns array of centroids in row-major array-of-array-of-doubles format.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close