tech.ml.dataset.base

Liking cljdoc? Tell your friends :D

Clojure only.

->>dataset
->dataset
->sort-by
->sort-by-column
->string-reader
add-column
add-or-update-column
aggregate-by
aggregate-by-column
check-dataset-wrong-position
column
column-count
column-name->column-map
column-names
columns
columns-with-missing-seq
concat
concat-copying
concat-inplace
count-true
dataset->string
dataset-name
drop-columns
drop-rows
ds-column-count
ds-concat
ds-filter
ds-filter-column
ds-group-by
ds-group-by-column
ds-row-count
ds-sort-by
ds-sort-by-column
ds-take-nth
ensure-array-backed
filter
filter-column
from-prototype
group-by
group-by->indexes
group-by-column
group-by-column->indexes
has-column?
index-value-seq
mapseq-reader
maybe-column
metadata
missing
new-column
order-column-names
remove-column
remove-columns
remove-rows
rename-columns
row-count
select
select-columns
select-rows
set-dataset-name
set-metadata
sort-by
sort-by-column
str->file-info
supported-column-stats
take-nth
unique-by
unique-by-column
unordered-select
update-column
update-columns
value-reader
wrap-stream-fn
write-csv!

->>dataset^clj

(->>dataset dataset)

(->>dataset options dataset)

Please see documentation of ->dataset. Options are the same.

Please see documentation of ->dataset.  Options are the same.

source raw docstring

->dataset^clj

(->dataset dataset)

(->dataset dataset {:keys [table-name dataset-name] :as options})

Create a dataset from either csv/tsv or a sequence of maps.

A String or InputStream will be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.
A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created. Options: :table-name - set the name of the dataset (deprecated in favor of :dataset-name). :dataset-name - set the name of the dataset. :file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that arrow and parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :arrow :parquet}. :gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream. :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist. :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist. :num-rows - Number of rows to read :header-row? - Defaults to true, indicates the first row is a header. :key-fn - function to be applied to column names. Typical use is: :key-fn keyword. :separator - Add a character separator to the list of separators to auto-detect. :csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such). :bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options . If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row. :skip-bad-rows? - Legacy option. Use :bad-row-policy. :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception. :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301 :parser-fn -

keyword - all columns parsed to this datatype
ifn? - called with two arguments: (parser-fn column-name-or-idx column-data) - Return value must be implement tech.ml.dataset.parser.PColumnParser in which case that is used or can return nil in which case the default column parser is used.
tuple - pair of [datatype parse-fn] in which case container of type [datatype] will be created. parse-fn can be one of: :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes. fn? - function from str-> one of #{:missing :parse-failure value}. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated. string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
map - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used. :parser-scan-len - Length of initial column data used for parser-fn's datatype detection routine. Defaults to 100.

Returns a new dataset

Create a dataset from either csv/tsv or a sequence of maps.
 *  A `String` or `InputStream` will be interpreted as a file (or gzipped file if it
 ends with .gz) of tsv or csv data.  The system will attempt to autodetect if this
 is csv or tsv and then engineering around detecting datatypes all of which can
 be overridden.
 *  A sequence of maps may be passed in in which case the first N maps are scanned in
 order to derive the column datatypes before the actual columns are created.
Options:
:table-name - set the name of the dataset (deprecated in favor of :dataset-name).
:dataset-name - set the name of the dataset.
:file-type - Override filetype discovery mechanism for strings or force a particular
   parser for an input stream.  Note that arrow and parquet must have paths on disk
   and cannot currently load from input stream.  Acceptible file types are:
   #{:csv :tsv :xlsx :xls :arrow :parquet}.
:gzipped? - for file formats that support it, override autodetection and force
   creation of a gzipped input stream as opposed to a normal input stream.
:column-whitelist - either sequence of string column names or sequence of column
   indices of columns to whitelist.
:column-blacklist - either sequence of string column names or sequence of column
   indices of columns to blacklist.
:num-rows - Number of rows to read
:header-row? - Defaults to true, indicates the first row is a header.
:key-fn - function to be applied to column names.  Typical use is:
   `:key-fn keyword`.
:separator - Add a character separator to the list of separators to auto-detect.
:csv-parser - Implementation of univocity's AbstractParser to use.  If not provided
   a default permissive parser is used.  This way you parse anything that univocity
   supports (so flat files and such).
:bad-row-policy - One of three options: :skip, :error, :carry-on.  Defaults to
   :carry-on.  Some csv data has ragged rows and in this case we have several options
   .  If the option is :carry-on then we either create a new column or add missing
   values for columns that had no data for that row.
:skip-bad-rows? - Legacy option.  Use :bad-row-policy.
:max-chars-per-column - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
:max-num-columns - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301
:parser-fn -
 - keyword - all columns parsed to this datatype
 - ifn? - called with two arguments: (parser-fn column-name-or-idx column-data)
        - Return value must be implement tech.ml.dataset.parser.PColumnParser in
          which case that is used or can return nil in which case the default
          column parser is used.
 - tuple - pair of [datatype parse-fn] in which case container of type [datatype]
         will be created.
         parse-fn can be one of:
      :relaxed? - data will be parsed such that parse failures of the standard
         parse functions do not stop the parsing process.  :unparsed-values and
         :unparsed-indexes are available in the metadata of the column that tell
         you the values that failed to parse and their respective indexes.
      fn? - function from str-> one of #{:missing :parse-failure value}.
         Exceptions here always kill the parse process.  :missing will get marked
         in the missing indexes, and :parse-failure will result in the index being
         added to missing, the unparsed the column's :unparsed-values and
         :unparsed-indexes will be updated.
      string? - for datetime types, this will turned into a DateTimeFormatter via
         DateTimeFormatter/ofPattern.
      DateTimeFormatter - use with the appropriate temporal parse static function
         to parse the value.
 - map - the header-name-or-idx is used to lookup value.  If not nil, then
         value can be any of the above options.  Else the default column parser
         is used.
:parser-scan-len - Length of initial column data used for parser-fn's datatype
     detection routine. Defaults to 100.

Returns a new dataset

source raw docstring

->sort-by^clj

(->sort-by dataset key-fn)

(->sort-by dataset key-fn compare-fn)

(->sort-by dataset key-fn compare-fn column-name-seq)

Version of sort-by used in -> statements common in dataflows

Version of sort-by used in -> statements common in dataflows

source raw docstring

->sort-by-column^clj

(->sort-by-column dataset colname)

(->sort-by-column dataset colname compare-fn)

sort-by-column used in -> dataflows

sort-by-column used in -> dataflows

source raw docstring

->string-reader^clj

(->string-reader reader missing)

source

add-column^clj

(add-column dataset column)

Add a new column. Error if name collision

Add a new column. Error if name collision

source raw docstring

add-or-update-column^clj

(add-or-update-column dataset column)

(add-or-update-column dataset colname column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.

source raw docstring

aggregate-by^clj

(aggregate-by map-fn
              dataset
              &
              {:keys [column-name-seq numeric-aggregate-fn boolean-aggregate-fn
                      default-aggregate-fn count-column-name]
               :or {numeric-aggregate-fn dfn/reduce-+
                    boolean-aggregate-fn count-true
                    default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn. Returns aggregated datatset. :aggregate-fn - passed a sequence of columns and must return a new column with the same number of entries as the count of the column sequences.

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.

source raw docstring

aggregate-by-column^clj

(aggregate-by-column colname
                     dataset
                     &
                     {:keys [numeric-aggregate-fn boolean-aggregate-fn
                             default-aggregate-fn count-column-name]
                      :or {numeric-aggregate-fn dfn/reduce-+
                           boolean-aggregate-fn count-true
                           default-aggregate-fn first}})

Group the dataset by map-fn, then aggregate by the aggregate fn.
Returns aggregated datatset.
:aggregate-fn - passed a sequence of columns and must return a new column
with the same number of entries as the count of the column sequences.

source raw docstring

check-dataset-wrong-position^clj

(check-dataset-wrong-position item)

source

column^clj

(column dataset column-name)

Return the column or throw if it doesn't exist.

Return the column or throw if it doesn't exist.

source raw docstring

column-count^clj

(column-count dataset)

source

column-name->column-map^clj

(column-name->column-map datatypes)

clojure map of column-name->column

clojure map of column-name->column

source raw docstring

column-names^clj

(column-names dataset)

In-order sequence of column names

In-order sequence of column names

source raw docstring

columns^clj

(columns dataset)

Return sequence of all columns in dataset.

Return sequence of all columns in dataset.

source raw docstring

columns-with-missing-seq^clj

(columns-with-missing-seq dataset)

Return a sequence of: {:column-name column-name :missing-count missing-count } or nil of no columns are missing data.

Return a sequence of:
{:column-name column-name
 :missing-count missing-count
}
or nil of no columns are missing data.

source raw docstring

concat^clj

(concat dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes. Also see concat-copying as this may be faster in many situations.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
Also see concat-copying as this may be faster in many situations.

source raw docstring

concat-copying^clj

(concat-copying dataset & datasets)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets into a new dataset copying data.  Respects missing values.
Datasets must all have the same columns.  Result column datatypes will be a widening
cast of the datatypes.

source raw docstring

concat-inplace^clj

(concat-inplace dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.

source raw docstring

count-true^clj

(count-true boolean-seq)

source

dataset->string^clj

(dataset->string ds)

source

dataset-name^clj

(dataset-name dataset)

source

drop-columns^clj

(drop-columns dataset col-name-seq)

Same as remove-columns

Same as remove-columns

source raw docstring

drop-rows^clj

(drop-rows dataset row-indexes)

Same as remove-rows.

Same as remove-rows.

source raw docstring

ds-column-count^clj

(ds-column-count dataset)

source

ds-concat^clj

(ds-concat dataset & other-datasets)

Legacy method. Please see concat

Legacy method.  Please see concat

source raw docstring

ds-filter^clj

(ds-filter predicate dataset & [column-name-seq])

Legacy method. Please see filter

Legacy method.  Please see filter

source raw docstring

ds-filter-column^clj

(ds-filter-column predicate colname dataset)

Legacy method. Please see filter-column

Legacy method.  Please see filter-column

source raw docstring

ds-group-by^clj

(ds-group-by key-fn dataset & [column-name-seq])

Legacy method. Please see group-by

Legacy method. Please see group-by

source raw docstring

ds-group-by-column^clj

(ds-group-by-column colname dataset)

Legacy method. Please see group-by-column

Legacy method.  Please see group-by-column

source raw docstring

ds-row-count^clj

(ds-row-count dataset)

source

ds-sort-by^clj

(ds-sort-by key-fn dataset)

(ds-sort-by key-fn compare-fn dataset)

(ds-sort-by key-fn compare-fn column-name-seq dataset)

Legacy method. Please see sort-by

Legacy method.  Please see sort-by

source raw docstring

ds-sort-by-column^clj

(ds-sort-by-column colname dataset)

(ds-sort-by-column colname compare-fn dataset)

Legacy method. Please see sort by column.

Legacy method.  Please see sort by column.

source raw docstring

ds-take-nth^clj

(ds-take-nth n-val dataset)

Legacy method. Please see take-nth

Legacy method.  Please see take-nth

source raw docstring

ensure-array-backed^clj

(ensure-array-backed ds)

(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

options - :unpack? - unpack packed datetime types. Defaults to true

Ensure the column data in the dataset is stored in pure java arrays.  This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete.  This also clears the missing set
for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not
changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate
datatype for each column.

options -
:unpack? - unpack packed datetime types.  Defaults to true

source raw docstring

filter^clj

(filter predicate dataset)

(filter predicate column-name-seq dataset)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

dataset->dataset transformation.  Predicate is passed a map of
colname->column-value.

source raw docstring

filter-column^clj

(filter-column predicate colname dataset)

Filter a given column by a predicate. Predicate is passed column values. truthy values are kept. Returns a dataset.

Filter a given column by a predicate.  Predicate is passed column values.
truthy values are kept.  Returns a dataset.

source raw docstring

from-prototype^clj

(from-prototype dataset table-name column-seq)

Create a new dataset that is the same type as this one but with a potentially different table name and column sequence. Take care that the columns are all of the correct type.

Create a new dataset that is the same type as this one but with a potentially
different table name and column sequence.  Take care that the columns are all of
the correct type.

source raw docstring

group-by^clj

(group-by key-fn dataset)

(group-by key-fn column-name-seq dataset)

Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.

Produce a map of key-fn-value->dataset.  key-fn is a function taking
a map of colname->column-value.  Selecting which columns are used in the key-fn
using column-name-seq is optional but will greatly improve performance.

source raw docstring

group-by->indexes^clj

(group-by->indexes key-fn dataset)

(group-by->indexes key-fn column-name-seq dataset)

source

group-by-column^clj

(group-by-column colname dataset)

Return a map of column-value->dataset.

Return a map of column-value->dataset.

source raw docstring

group-by-column->indexes^clj

(group-by-column->indexes colname dataset)

source

has-column?^clj

(has-column? dataset column-name)

source

index-value-seq^clj

(index-value-seq dataset & [reader-options])

Get a sequence of tuples: [idx col-value-vec]

Values are in order of column-name-seq. Duplicate names are allowed and result in duplicate values.

Get a sequence of tuples:
  [idx col-value-vec]

Values are in order of column-name-seq.  Duplicate names are allowed and result in
duplicate values.

source raw docstring

mapseq-reader^clj

(mapseq-reader dataset)

(mapseq-reader dataset options)

Return a reader that produces a map of column-name->column-value

Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a map of column-name->column-value

Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.

source raw docstring

maybe-column^clj

(maybe-column dataset column-name)

Return either column if exists or nil.

Return either column if exists or nil.

source raw docstring

metadata^clj

(metadata dataset)

source

missing^clj

(missing dataset)

source

new-column^clj

(new-column dataset column-name values)

Create a new column from some values

Create a new column from some values

source raw docstring

order-column-names^clj

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.

source raw docstring

remove-column^clj

(remove-column dataset col-name)

Fails quietly

Fails quietly

source raw docstring

remove-columns^clj

(remove-columns dataset colname-seq)

Same as drop-columns

Same as drop-columns

source raw docstring

remove-rows^clj

(remove-rows dataset row-indexes)

Same as drop-rows.

Same as drop-rows.

source raw docstring

rename-columns^clj

(rename-columns dataset colname-map)

Rename columns using a map. Does not reorder columns.

Rename columns using a map.  Does not reorder columns.

source raw docstring

row-count^clj

(row-count dataset)

source

select^clj

(select dataset colname-seq index-seq)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

:all - all the columns
sequence of column names - those columns in that order.
implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.

Reorder/trim dataset according to this sequence of indexes.  Returns a new dataset.
colname-seq - one of:
  - :all - all the columns
  - sequence of column names - those columns in that order.
  - implementation of java.util.Map - column order is dictate by map iteration order
     selected columns are subsequently named after the corresponding value in the map.
     similar to `rename-columns` except this trims the result to be only the columns
     in the map.
index-seq - either keyword :all or list of indexes.  May contain duplicates.

source raw docstring

select-columns^clj

(select-columns dataset col-name-seq)

source

select-rows^clj

(select-rows dataset row-indexes)

source

set-dataset-name^clj

(set-dataset-name dataset ds-name)

source

set-metadata^clj

(set-metadata dataset meta-map)

source

sort-by^clj

(sort-by key-fn dataset)

(sort-by key-fn compare-fn dataset)

(sort-by key-fn compare-fn column-name-seq dataset)

Sort a dataset by a key-fn and compare-fn.

Sort a dataset by a key-fn and compare-fn.

source raw docstring

sort-by-column^clj

(sort-by-column colname dataset)

(sort-by-column colname compare-fn dataset)

Sort a dataset by a given column using the given compare fn.

Sort a dataset by a given column using the given compare fn.

source raw docstring

str->file-info^clj

(str->file-info file-str)

source

supported-column-stats^clj

(supported-column-stats dataset)

Return the set of natively supported stats for the dataset. This must be at least #{:mean :variance :median :skew}.

Return the set of natively supported stats for the dataset.  This must be at least
#{:mean :variance :median :skew}.

source raw docstring

take-nth^clj

(take-nth n-val dataset)

source

unique-by^clj

(unique-by map-fn dataset)

(unique-by map-fn
           {:keys [column-name-seq keep-fn]
            :or {keep-fn (fn* [p1__38868# p2__38867#] (first p2__38867#))}
            :as _options}
           dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx.  Defaults to #(first %2).

source raw docstring

unique-by-column^clj

(unique-by-column colname dataset)

(unique-by-column colname
                  {:keys [keep-fn]
                   :or {keep-fn (fn* [p1__38881# p2__38880#]
                                     (first p2__38880#))}
                   :as _options}
                  dataset)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx.  Defaults to #(first %2).

source raw docstring

unordered-select^clj

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

Perform a selection but use the order of the columns in the existing table; do
*not* reorder the columns based on colname-seq.  Useful when doing selection based
on sets or persistent hash maps.

source raw docstring

update-column^clj

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.

source raw docstring

update-columns^clj

(update-columns dataset column-name-seq update-fn)

Update a sequence of columns.

Update a sequence of columns.

source raw docstring

value-reader^clj

(value-reader dataset)

(value-reader dataset options)

Return a reader that produces a reader of column values per index. Options: :missing-nil? - Default to true - Substitute nil in for missing values to make missing value detection downstream to be column datatype independent.

Return a reader that produces a reader of column values per index.
Options:
:missing-nil? - Default to true - Substitute nil in for missing values to make
  missing value detection downstream to be column datatype independent.

source raw docstring

wrap-stream-fn^clj

(wrap-stream-fn dataset gzipped? open-fn)

source

write-csv!^clj

(write-csv! ds output)

(write-csv! ds output options)

Write a dataset to a tsv or csv output stream. Closes output if a stream is passed in. File output format will be inferred if output is a string -

.csv, .tsv - switches between tsv, csv. Tsv is the default.
*.gz - write to a gzipped stream. At this time writing to json is not supported. options - :separator - in case output isn't a string, you can use either , or \tab to switch between csv or tsv output respectively.

Write a dataset to a tsv or csv output stream.  Closes output if a stream
is passed in.  File output format will be inferred if output is a string -
  - .csv, .tsv - switches between tsv, csv.  Tsv is the default.
  - *.gz - write to a gzipped stream.
At this time writing to json is not supported.
options -
:separator - in case output isn't a string, you can use either \, or \tab to switch
  between csv or tsv output respectively.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close