(->>dataset dataset)
(->>dataset options dataset)
Please see documentation of ->dataset. Options are the same.
Please see documentation of ->dataset. Options are the same.
(->dataset dataset)
(->dataset dataset {:keys [table-name dataset-name] :as options})
Create a dataset from either csv/tsv or a sequence of maps.
String
or InputStream
will be interpreted as a file (or gzipped file if it
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
is csv or tsv and then engineering around detecting datatypes all of which can
be overridden.:key-fn keyword
.
:separator - Add a character separator to the list of separators to auto-detect.
:csv-parser - Implementation of univocity's AbstractParser to use. If not provided
a default permissive parser is used. This way you parse anything that univocity
supports (so flat files and such).
:bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to
:carry-on. Some csv data has ragged rows and in this case we have several options
. If the option is :carry-on then we either create a new column or add missing
values for columns that had no data for that row.
:skip-bad-rows? - Legacy option. Use :bad-row-policy.
:max-chars-per-column - Defaults to 4096. Columns with more characters that this
will result in an exception.
:max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this
will fail to parse. For more information on this option, please visit:
https://github.com/uniVocity/univocity-parsers/issues/301
:parser-fn -Returns a new dataset
Create a dataset from either csv/tsv or a sequence of maps. * A `String` or `InputStream` will be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden. * A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created. Options: :table-name - set the name of the dataset (deprecated in favor of :dataset-name). :dataset-name - set the name of the dataset. :file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :arrow :parquet}. :gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream. :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist. :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist. :num-rows - Number of rows to read :header-row? - Defaults to true, indicates the first row is a header. :key-fn - function to be applied to column names. Typical use is: `:key-fn keyword`. :separator - Add a character separator to the list of separators to auto-detect. :csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such). :bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options . If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row. :skip-bad-rows? - Legacy option. Use :bad-row-policy. :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception. :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301 :parser-fn - - keyword - all columns parsed to this datatype - ifn? - called with two arguments: (parser-fn column-name-or-idx column-data) - Return value must be implement tech.ml.dataset.parser.PColumnParser in which case that is used or can return nil in which case the default column parser is used. - tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created. parse-fn can be one of: :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes. fn? - function from str-> one of #{:missing :parse-failure value}. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated. string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. DateTimeFormatter - use with the appropriate temporal parse static function to parse the value. - map - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used. :parser-scan-len - Length of initial column data used for parser-fn's datatype detection routine. Defaults to 100. Returns a new dataset
(->sort-by dataset key-fn)
(->sort-by dataset key-fn compare-fn)
(->sort-by dataset key-fn compare-fn column-name-seq)
Version of sort-by used in -> statements common in dataflows
Version of sort-by used in -> statements common in dataflows
(->sort-by-column dataset colname)
(->sort-by-column dataset colname compare-fn)
sort-by-column used in -> dataflows
sort-by-column used in -> dataflows
(add-column dataset column)
Add a new column. Error if name collision
Add a new column. Error if name collision
(add-or-update-column dataset column)
(add-or-update-column dataset colname column)
If column exists, replace. Else append new column.
If column exists, replace. Else append new column.
(aggregate-by map-fn
dataset
&
{:keys [column-name-seq numeric-aggregate-fn boolean-aggregate-fn
default-aggregate-fn count-column-name]
:or {numeric-aggregate-fn dfn/reduce-+
boolean-aggregate-fn count-true
default-aggregate-fn first}})
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
(aggregate-by-column colname
dataset
&
{:keys [numeric-aggregate-fn boolean-aggregate-fn
default-aggregate-fn count-column-name]
:or {numeric-aggregate-fn dfn/reduce-+
boolean-aggregate-fn count-true
default-aggregate-fn first}})
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.
(column dataset column-name)
Return the column or throw if it doesn't exist.
Return the column or throw if it doesn't exist.
(column-name->column-map datatypes)
clojure map of column-name->column
clojure map of column-name->column
(column-names dataset)
In-order sequence of column names
In-order sequence of column names
(columns dataset)
Return sequence of all columns in dataset.
Return sequence of all columns in dataset.
(columns-with-missing-seq dataset)
Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.
Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.
(concat dataset & datasets)
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.
(concat-copying dataset & datasets)
Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
(concat-inplace dataset & datasets)
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.
(drop-columns dataset col-name-seq)
Same as remove-columns
Same as remove-columns
(drop-rows dataset row-indexes)
Same as remove-rows.
Same as remove-rows.
(ds-concat dataset & other-datasets)
Legacy method. Please see concat
Legacy method. Please see concat
(ds-filter predicate dataset & [column-name-seq])
Legacy method. Please see filter
Legacy method. Please see filter
(ds-filter-column predicate colname dataset)
Legacy method. Please see filter-column
Legacy method. Please see filter-column
(ds-group-by key-fn dataset & [column-name-seq])
Legacy method. Please see group-by
Legacy method. Please see group-by
(ds-group-by-column colname dataset)
Legacy method. Please see group-by-column
Legacy method. Please see group-by-column
(ds-sort-by key-fn dataset)
(ds-sort-by key-fn compare-fn dataset)
(ds-sort-by key-fn compare-fn column-name-seq dataset)
Legacy method. Please see sort-by
Legacy method. Please see sort-by
(ds-sort-by-column colname dataset)
(ds-sort-by-column colname compare-fn dataset)
Legacy method. Please see sort by column.
Legacy method. Please see sort by column.
(ds-take-nth n-val dataset)
Legacy method. Please see take-nth
Legacy method. Please see take-nth
(ensure-array-backed ds)
(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})
Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.
Columns that are already array backed and that have no missing values are not changed and retuned.
The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.
options - :unpack? - unpack packed datetime types. Defaults to true
Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays. Columns that are already array backed and that have no missing values are not changed and retuned. The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column. options - :unpack? - unpack packed datetime types. Defaults to true
(filter predicate dataset)
(filter predicate column-name-seq dataset)
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
dataset->dataset transformation. Predicate is passed a map of colname->column-value.
(filter-column predicate colname dataset)
Filter a given column by a predicate. Predicate is passed column values. truthy values are kept. Returns a dataset.
Filter a given column by a predicate. Predicate is passed column values. truthy values are kept. Returns a dataset.
(from-prototype dataset table-name column-seq)
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.
(group-by key-fn dataset)
(group-by key-fn column-name-seq dataset)
Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.
Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.
(group-by->indexes key-fn dataset)
(group-by->indexes key-fn column-name-seq dataset)
(group-by-column colname dataset)
Return a map of column-value->dataset.
Return a map of column-value->dataset.
(index-value-seq dataset & [reader-options])
Get a sequence of tuples: [idx col-value-vec]
Values are in order of column-name-seq. Duplicate names are allowed and result in duplicate values.
Get a sequence of tuples: [idx col-value-vec] Values are in order of column-name-seq. Duplicate names are allowed and result in duplicate values.
(mapseq-reader dataset)
(mapseq-reader dataset options)
Return a reader that produces a map of column-name->column-value
Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
Return a reader that produces a map of column-name->column-value Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
(maybe-column dataset column-name)
Return either column if exists or nil.
Return either column if exists or nil.
(new-column dataset column-name values)
Create a new column from some values
Create a new column from some values
(order-column-names dataset colname-seq)
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.
(remove-columns dataset colname-seq)
Same as drop-columns
Same as drop-columns
(remove-rows dataset row-indexes)
Same as drop-rows.
Same as drop-rows.
(rename-columns dataset colname-map)
Rename columns using a map. Does not reorder columns.
Rename columns using a map. Does not reorder columns.
(select dataset colname-seq index-seq)
Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:
rename-columns
except this trims the result to be only the columns
in the map.
index-seq - either keyword :all or list of indexes. May contain duplicates.Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of: - :all - all the columns - sequence of column names - those columns in that order. - implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to `rename-columns` except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.
(sort-by key-fn dataset)
(sort-by key-fn compare-fn dataset)
(sort-by key-fn compare-fn column-name-seq dataset)
Sort a dataset by a key-fn and compare-fn.
Sort a dataset by a key-fn and compare-fn.
(sort-by-column colname dataset)
(sort-by-column colname compare-fn dataset)
Sort a dataset by a given column using the given compare fn.
Sort a dataset by a given column using the given compare fn.
(supported-column-stats dataset)
Return the set of natively supported stats for the dataset. This must be at least #{:mean :variance :median :skew}.
Return the set of natively supported stats for the dataset. This must be at least #{:mean :variance :median :skew}.
(unique-by map-fn dataset)
(unique-by map-fn
{:keys [column-name-seq keep-fn]
:or {keep-fn (fn* [p1__38868# p2__38867#] (first p2__38867#))}
:as _options}
dataset)
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.
:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep. :keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).
(unique-by-column colname dataset)
(unique-by-column colname
{:keys [keep-fn]
:or {keep-fn (fn* [p1__38881# p2__38880#]
(first p2__38880#))}
:as _options}
dataset)
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.
:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep. :keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).
(unordered-select dataset colname-seq index-seq)
Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
Perform a selection but use the order of the columns in the existing table; do *not* reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.
(update-column dataset col-name update-fn)
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.
(update-columns dataset column-name-seq update-fn)
Update a sequence of columns.
Update a sequence of columns.
(value-reader dataset)
(value-reader dataset options)
Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.
(write-csv! ds output)
(write-csv! ds output options)
Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string -
Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string - - .csv, .tsv - switches between tsv, csv. Tsv is the default. - *.gz - write to a gzipped stream. At this time writing to json is not supported. options - :separator - in case output isn't a string, you can use either \, or \tab to switch between csv or tsv output respectively.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close