Liking cljdoc? Tell your friends :D

clj-djl.dataframe


->dataframeclj

source

->datasetclj

(->dataset dataset)
(->dataset dataset {:keys [table-name dataset-name] :as options})

Create a dataset from either csv/tsv or a sequence of maps.

  • A String be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.

  • InputStreams have no file type and thus a file-type must be provided in the options.

  • A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries which are tech.v3.libs.parquet for parquet, tech.v3.libs.fastexcel for xlsx, and tech.v3.libs.poi for xls.

Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type overload as the Arrow project current has 3 different file types and it is not clear what their final suffix will be or which of the three file types it will indicate. Please see documentation in the tech.v3.libs.arrow namespace for further information on Arrow file types.

Options:

  • :dataset-name - set the name of the dataset.
  • :file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :parquet}.
  • :gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.
  • :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist.
  • :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist.
  • :num-rows - Number of rows to read
  • :header-row? - Defaults to true, indicates the first row is a header.
  • :key-fn - function to be applied to column names. Typical use is: :key-fn keyword.
  • :separator - Add a character separator to the list of separators to auto-detect.
  • :csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).
  • :bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.
  • :skip-bad-rows? - Legacy option. Use :bad-row-policy.
  • :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.
  • :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301
  • :n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.
  • :parser-fn -
    • keyword? - all columns parsed to this datatype. For example: {:parser-fn :string}
    • map? - {column-name parse-method} parse each column with specified parse-method. The parse-method can be:
      • keyword? - parse the specified column to this datatype. For example: {:parser-fn {:answer :boolean :id :int32}}
      • tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
        • :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
        • fn? - function from str-> one of :tech.ml.dataset.parser/missing, :tech.ml.dataset.parser/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
        • string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.
        • DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
  • map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.

Returns a new dataset

Create a dataset from either csv/tsv or a sequence of maps.

 * A `String` be interpreted as a file (or gzipped file if it
   ends with .gz) of tsv or csv data.  The system will attempt to autodetect if this
   is csv or tsv and then engineering around detecting datatypes all of which can
   be overridden.

* InputStreams have no file type and thus a `file-type` must be provided in the
  options.

* A sequence of maps may be passed in in which case the first N maps are scanned in
  order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries
which are `tech.v3.libs.parquet` for parquet, `tech.v3.libs.fastexcel` for xlsx,
and `tech.v3.libs.poi` for xls.


Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type
overload as the Arrow project current has 3 different file types and it is not clear
what their final suffix will be or which of the three file types it will indicate.
Please see documentation in the `tech.v3.libs.arrow` namespace for further information
on Arrow file types.

Options:

- `:dataset-name` - set the name of the dataset.
- `:file-type` - Override filetype discovery mechanism for strings or force a particular
    parser for an input stream.  Note that parquet must have paths on disk
    and cannot currently load from input stream.  Acceptible file types are:
    #{:csv :tsv :xlsx :xls :parquet}.
- `:gzipped?` - for file formats that support it, override autodetection and force
   creation of a gzipped input stream as opposed to a normal input stream.
- `:column-whitelist` - either sequence of string column names or sequence of column
   indices of columns to whitelist.
- `:column-blacklist` - either sequence of string column names or sequence of column
   indices of columns to blacklist.
- `:num-rows` - Number of rows to read
- `:header-row?` - Defaults to true, indicates the first row is a header.
- `:key-fn` - function to be applied to column names.  Typical use is:
   `:key-fn keyword`.
- `:separator` - Add a character separator to the list of separators to auto-detect.
- `:csv-parser` - Implementation of univocity's AbstractParser to use.  If not
   provided a default permissive parser is used.  This way you parse anything that
   univocity supports (so flat files and such).
- `:bad-row-policy` - One of three options: :skip, :error, :carry-on.  Defaults to
   :carry-on.  Some csv data has ragged rows and in this case we have several
   options. If the option is :carry-on then we either create a new column or add
   missing values for columns that had no data for that row.
- `:skip-bad-rows?` - Legacy option.  Use :bad-row-policy.
- `:max-chars-per-column` - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
- `:max-num-columns` - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301
- `:n-initial-skip-rows` - Skip N rows initially.  This currently may include the header
   row.  Works across both csv and spreadsheet datasets.
- `:parser-fn` -
    - `keyword?` - all columns parsed to this datatype. For example: `{:parser-fn :string}`
    - `map?` - `{column-name parse-method}` parse each column with specified `parse-method`.
      The `parse-method` can be:
        - `keyword?` - parse the specified column to this datatype. For example:
          `{:parser-fn {:answer :boolean :id :int32}}`
        - tuple - pair of `[datatype parse-data]` in which case container of type
          `[datatype]` will be created. `parse-data` can be one of:
            - `:relaxed?` - data will be parsed such that parse failures of the standard
               parse functions do not stop the parsing process.  :unparsed-values and
               :unparsed-indexes are available in the metadata of the column that tell
               you the values that failed to parse and their respective indexes.
            - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
               `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
               Exceptions here always kill the parse process.  :missing will get marked
               in the missing indexes, and :parse-failure will result in the index being
               added to missing, the unparsed the column's :unparsed-values and
               :unparsed-indexes will be updated.
            - `string?` - for datetime types, this will turned into a DateTimeFormatter via
               DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
               argument to Charset/forName.
            - `DateTimeFormatter` - use with the appropriate temporal parse static function
               to parse the value.
 - `map?` - the header-name-or-idx is used to lookup value.  If not nil, then
         value can be any of the above options.  Else the default column parser
         is used.

Returns a new dataset
sourceraw docstring

->ndarrayclj

(->ndarray ndm dataframe)

Convert dataframe to NDArray

Convert dataframe to NDArray
sourceraw docstring

add-columnclj

(add-column dataset column)

Add a new column. Error if name collision

Add a new column. Error if name collision
sourceraw docstring

add-or-update-columnclj

(add-or-update-column dataset column)
(add-or-update-column dataset colname column)

If column exists, replace. Else append new column.

If column exists, replace.  Else append new column.
sourceraw docstring

assoc-dsclj

(assoc-ds dataset cname cdata & args)

If dataset is not nil, calls clojure.core/assoc. Else creates a new empty dataset and then calls clojure.core/assoc. Guaranteed to return a dataset (unlike assoc).

If dataset is not nil, calls `clojure.core/assoc`. Else creates a new empty dataset and
then calls `clojure.core/assoc`.  Guaranteed to return a dataset (unlike assoc).
sourceraw docstring

briefclj

(brief ds)
(brief ds options)

Get a brief description, in mapseq form of a dataset. A brief description is the mapseq form of descriptive stats.

Get a brief description, in mapseq form of a dataset.  A brief description is
the mapseq form of descriptive stats.
sourceraw docstring

categorical->one-hotclj

(categorical->one-hot dataset filter-fn-or-ds)
(categorical->one-hot dataset filter-fn-or-ds table-args)
(categorical->one-hot dataset filter-fn-or-ds table-args result-datatype)

Convert string columns to numeric columns. See tech.v3.dataset.categorical/fit-one-hot

Convert string columns to numeric columns.
See tech.v3.dataset.categorical/fit-one-hot
sourceraw docstring

columnclj

(column dataset colname)
source

column-countclj

(column-count dataset)
source

column-namesclj

(column-names dataset)

In-order sequence of column names

In-order sequence of column names
sourceraw docstring

columnsclj

(columns dataset)

Return sequence of all columns in dataset.

Return sequence of all columns in dataset.
sourceraw docstring

columns-with-missing-seqclj

(columns-with-missing-seq dataset)

Return a sequence of:

  {:column-name column-name
   :missing-count missing-count
  }

or nil of no columns are missing data.

Return a sequence of:
```clojure
  {:column-name column-name
   :missing-count missing-count
  }
```
  or nil of no columns are missing data.
sourceraw docstring

concatclj

(concat dataset & datasets)

Concatenate datasets in place. See also concat-copying as it may be more efficient for your use case.

Concatenate datasets in place.  See also concat-copying as it may be more
efficient for your use case.
sourceraw docstring

concat-copyingclj

(concat-copying dataset & datasets)

Concatenate datasets into a new dataset copying data. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets into a new dataset copying data.  Respects missing values.
Datasets must all have the same columns.  Result column datatypes will be a widening
cast of the datatypes.
sourceraw docstring

concat-inplaceclj

(concat-inplace dataset & datasets)

Concatenate datasets in place. Respects missing values. Datasets must all have the same columns. Result column datatypes will be a widening cast of the datatypes.

Concatenate datasets in place.  Respects missing values.  Datasets must all have the
same columns.  Result column datatypes will be a widening cast of the datatypes.
sourceraw docstring

data->datasetclj

(data->dataset {:keys [metadata columns] :as input})

Convert a data-ized dataset created via dataset->data back into a full dataset

Convert a data-ized dataset created via dataset->data back into a
full dataset
sourceraw docstring

dataset->dataclj

(dataset->data ds)

Convert a dataset to a pure clojure datastructure. Returns a map with two keys: {:metadata :columns}. :columns is a vector of column definitions appropriate for passing directly back into new-dataset. A column definition in this case is a map of {:name :missing :data :metadata}.

Convert a dataset to a pure clojure datastructure.  Returns a map with two keys:
{:metadata :columns}.
:columns is a vector of column definitions appropriate for passing directly back
into new-dataset.
A column definition in this case is a map of {:name :missing :data :metadata}.
sourceraw docstring

dataset-nameclj

(dataset-name dataset)
source

drop-columnsclj

(drop-columns dataset col-name-seq)

Same as remove-columns

Same as remove-columns
sourceraw docstring

drop-missingclj

(drop-missing dataset-or-col)

Remove missing entries by simply selecting out the missing indexes

Remove missing entries by simply selecting out the missing indexes
sourceraw docstring

drop-rowsclj

(drop-rows dataset-or-col row-indexes)

Drop rows from dataset or column

Drop rows from dataset or column
sourceraw docstring

ensure-array-backedclj

(ensure-array-backed ds)
(ensure-array-backed ds {:keys [unpack?] :or {unpack? true}})

Ensure the column data in the dataset is stored in pure java arrays. This is sometimes necessary for interop with other libraries and this operation will force any lazy computations to complete. This also clears the missing set for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate datatype for each column.

options - :unpack? - unpack packed datetime types. Defaults to true

Ensure the column data in the dataset is stored in pure java arrays.  This is
sometimes necessary for interop with other libraries and this operation will
force any lazy computations to complete.  This also clears the missing set
for each column and writes the missing values to the new arrays.

Columns that are already array backed and that have no missing values are not
changed and retuned.

The postcondition is that dtype/->array will return a java array in the appropriate
datatype for each column.

options -
:unpack? - unpack packed datetime types.  Defaults to true
sourceraw docstring

filterclj

(filter dataset predicate)

dataset->dataset transformation. Predicate is passed a map of colname->column-value.

dataset->dataset transformation.  Predicate is passed a map of
colname->column-value.
sourceraw docstring

filter-columnclj

(filter-column dataset colname predicate)

Filter a given column by a predicate. Predicate is passed column values. If predicate is not an instance of Ifn it is treated as a value and will be used as if the predicate is #(= value %). Returns a dataset.

Filter a given column by a predicate.  Predicate is passed column values.
If predicate is *not* an instance of Ifn it is treated as a value and will
be used as if the predicate is #(= value %).
Returns a dataset.
sourceraw docstring

group-byclj

(group-by dataset key-fn)

Produce a map of key-fn-value->dataset. key-fn is a function taking a map of colname->column-value. Selecting which columns are used in the key-fn using column-name-seq is optional but will greatly improve performance.

Produce a map of key-fn-value->dataset.  key-fn is a function taking
a map of colname->column-value.  Selecting which columns are used in the key-fn
using column-name-seq is optional but will greatly improve performance.
sourceraw docstring

group-by->indexesclj

(group-by->indexes dataset key-fn)

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes is an in-order contiguous group of indexes.

(Non-lazy) - Group a dataset and return a map of key-fn-value->indexes where indexes
is an in-order contiguous group of indexes.
sourceraw docstring

group-by-columnclj

(group-by-column dataset colname)

Return a map of column-value->dataset.

Return a map of column-value->dataset.
sourceraw docstring

group-by-column->indexesclj

(group-by-column->indexes dataset colname)

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes where indexes is an in-order contiguous group of indexes.

(Non-lazy) - Group a dataset by a column return a map of column-val->indexes
where indexes is an in-order contiguous group of indexes.
sourceraw docstring

has-column?clj

(has-column? dataset column-name)
source

(head dataset)
(head dataset n)

Get the first n row of a dataset. Equivalent to `(select-rows ds (range n)). Arguments are reversed, however, so this can be used in ->> operators.

Get the first n row of a dataset.  Equivalent to
`(select-rows ds (range n)).  Arguments are reversed, however, so this can
be used in ->> operators.
sourceraw docstring

missingclj

(missing dataset-or-col)

Given a dataset or a column, return the missing set as a roaring bitmap

Given a dataset or a column, return the missing set as a roaring bitmap
sourceraw docstring

new-columnclj

(new-column name data)
(new-column name data metadata)
(new-column name data metadata missing)

Create a new column. Data will scanned for missing values unless the full 4-argument pathway is used.

Create a new column.  Data will scanned for missing values
unless the full 4-argument pathway is used.
sourceraw docstring

order-column-namesclj

(order-column-names dataset colname-seq)

Order a sequence of columns names so they match the order in the original dataset. Missing columns are placed last.

Order a sequence of columns names so they match the order in the
original dataset.  Missing columns are placed last.
sourceraw docstring

remove-columnclj

(remove-column dataset col-name)

Same as:

(dissoc dataset col-name)
Same as:

```clojure
(dissoc dataset col-name)
```
  
sourceraw docstring

remove-columnsclj

(remove-columns dataset colname-seq)

Same as drop-columns

Same as drop-columns
sourceraw docstring

remove-rowsclj

(remove-rows dataset-or-col row-indexes)

Same as drop-rows.

Same as drop-rows.
sourceraw docstring

rename-columnsclj

(rename-columns dataset colname-map)

Rename columns using a map. Does not reorder columns.

Rename columns using a map.  Does not reorder columns.
sourceraw docstring

replace-missingclj

(replace-missing ds)
(replace-missing ds strategy)
(replace-missing ds columns-selector strategy)
(replace-missing ds columns-selector strategy value)

Replace missing values in some columns with a given strategy. The columns selector may be any legal argument to select-columns. Strategies may be:

  • :down - take value from previous non-missing row if possible else use next non-missing row.
  • :up - take value from next non-missing row if possible else use previous non-missing row.
  • :mid - Use midpoint of averaged values between previous and next nonmissing rows.
  • :lerp - Linearly interpolate values between previous and next nonmissing rows.
  • :value - Value will be provided - see below. value may be provided which will then be used. Value may be a function in which case it will be called on the column with missing values elided and the return will be used to as the filler.
Replace missing values in some columns with a given strategy.
The columns selector may be any legal argument to select-columns.
Strategies may be:
- `:down` - take value from previous non-missing row if possible else use next
  non-missing row.
- `:up` - take value from next non-missing row if possible else use previous
   non-missing row.
- `:mid` - Use midpoint of averaged values between previous and next nonmissing
   rows.
- `:lerp` - Linearly interpolate values between previous and next nonmissing rows.
- `:value` - Value will be provided - see below.
value may be provided which will then be used.  Value may be a function in which
case it will be called on the column with missing values elided and the return will
be used to as the filler.
sourceraw docstring

row-countclj

(row-count dataset-or-col)
source

selectclj

(select dataset colname-seq index-seq)

Reorder/trim dataset according to this sequence of indexes. Returns a new dataset. colname-seq - one of:

  • :all - all the columns
  • sequence of column names - those columns in that order.
  • implementation of java.util.Map - column order is dictate by map iteration order selected columns are subsequently named after the corresponding value in the map. similar to rename-columns except this trims the result to be only the columns in the map. index-seq - either keyword :all or list of indexes. May contain duplicates.
Reorder/trim dataset according to this sequence of indexes.  Returns a new dataset.
colname-seq - one of:
  - :all - all the columns
  - sequence of column names - those columns in that order.
  - implementation of java.util.Map - column order is dictate by map iteration order
     selected columns are subsequently named after the corresponding value in the map.
     similar to `rename-columns` except this trims the result to be only the columns
     in the map.
index-seq - either keyword :all or list of indexes.  May contain duplicates.
sourceraw docstring

select-by-indexclj

(select-by-index dataframe row-index col-index)
source

select-columnsclj

(select-columns dataset col-name-seq)

Select columns from the dataset by seq of column names or :all.

Select columns from the dataset by seq of column names or :all.
sourceraw docstring

select-columns-by-indexclj

(select-columns-by-index dataset col-index)

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for select-by-index.

Select columns from the dataset by seq of index(includes negative) or :all.

See documentation for `select-by-index`.
sourceraw docstring

select-rowsclj

(select-rows dataset-or-col row-indexes)

Select rows from the dataset or column.

Select rows from the dataset or column.
sourceraw docstring

select-rows-by-indexclj

(select-rows-by-index dataset-or-col row-index)

Select rows from the dataset or column by seq of index(includes negative) or :all.

See documentation for select-by-index.

Select rows from the dataset or column by seq of index(includes negative) or :all.

See documentation for `select-by-index`.
sourceraw docstring

set-dataset-nameclj

(set-dataset-name dataset ds-name)
source

shapeclj

(shape dataframe)

Get the shape of dataset, row count first

Get the shape of dataset, row count first
sourceraw docstring

sort-byclj

(sort-by dataset key-fn)
(sort-by dataset key-fn compare-fn)

Sort a dataset by a key-fn and compare-fn.

Sort a dataset by a key-fn and compare-fn.
sourceraw docstring

sort-by-columnclj

(sort-by-column dataset colname)
(sort-by-column dataset colname compare-fn)

Sort a dataset by a given column using the given compare fn.

Sort a dataset by a given column using the given compare fn.
sourceraw docstring

tailclj

(tail dataset)
(tail dataset n)

Get the last n rows of a dataset. Equivalent to `(select-rows ds (range ...)). Argument order is dataset-last, however, so this can be used in ->> operators.

Get the last n rows of a dataset.  Equivalent to
`(select-rows ds (range ...)).  Argument order is dataset-last, however, so this can
be used in ->> operators.
sourceraw docstring

take-nthclj

(take-nth dataset n-val)
source

unique-byclj

(unique-by dataset map-fn)
(unique-by dataset
           {:keys [keep-fn]
            :or {keep-fn (fn* [p1__25510# p2__25509#] (first p2__25509#))}
            :as _options}
           map-fn)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key,idx-seq->idx.  Defaults to #(first %2).
sourceraw docstring

unique-by-columnclj

(unique-by-column dataset colname)
(unique-by-column dataset
                  {:keys [keep-fn]
                   :or {keep-fn (fn* [p1__25523# p2__25522#]
                                     (first p2__25522#))}
                   :as _options}
                  colname)

Map-fn function gets passed map for each row, rows are grouped by the return value. Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx. Defaults to #(first %2).

Map-fn function gets passed map for each row, rows are grouped by the
return value.  Keep-fn is used to decide the index to keep.

:keep-fn - Function from key, idx-seq->idx.  Defaults to #(first %2).
sourceraw docstring

unordered-selectclj

(unordered-select dataset colname-seq index-seq)

Perform a selection but use the order of the columns in the existing table; do not reorder the columns based on colname-seq. Useful when doing selection based on sets or persistent hash maps.

Perform a selection but use the order of the columns in the existing table; do
*not* reorder the columns based on colname-seq.  Useful when doing selection based
on sets or persistent hash maps.
sourceraw docstring

update-columnclj

(update-column dataset col-name update-fn)

Update a column returning a new dataset. update-fn is a column->column transformation. Error if column does not exist.

Update a column returning a new dataset.  update-fn is a column->column
transformation.  Error if column does not exist.
sourceraw docstring

update-columnsclj

(update-columns dataset column-name-seq update-fn)

Update a sequence of columns.

Update a sequence of columns.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close