clj-djl.dataframe

Liking cljdoc? Tell your friends :D

Clojure only.

->dataframe^clj

->dataset^clj

(->dataset dataset)

(->dataset dataset {:keys [table-name dataset-name] :as options})

Create a dataset from either csv/tsv or a sequence of maps.

A String be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden.
InputStreams have no file type and thus a file-type must be provided in the options.
A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries which are tech.v3.libs.parquet for parquet, tech.v3.libs.fastexcel for xlsx, and tech.v3.libs.poi for xls.

Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type overload as the Arrow project current has 3 different file types and it is not clear what their final suffix will be or which of the three file types it will indicate. Please see documentation in the tech.v3.libs.arrow namespace for further information on Arrow file types.

Options:

:dataset-name - set the name of the dataset.
:file-type - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :parquet}.
:gzipped? - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream.
:column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist.
:column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist.
:num-rows - Number of rows to read
:header-row? - Defaults to true, indicates the first row is a header.
:key-fn - function to be applied to column names. Typical use is: :key-fn keyword.
:separator - Add a character separator to the list of separators to auto-detect.
:csv-parser - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such).
:bad-row-policy - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row.
:skip-bad-rows? - Legacy option. Use :bad-row-policy.
:max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception.
:max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301
:n-initial-skip-rows - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets.
:parser-fn -
- keyword? - all columns parsed to this datatype. For example: {:parser-fn :string}
- map? - {column-name parse-method} parse each column with specified parse-method. The parse-method can be:
  - keyword? - parse the specified column to this datatype. For example: {:parser-fn {:answer :boolean :id :int32}}
  - tuple - pair of [datatype parse-data] in which case container of type [datatype] will be created. parse-data can be one of:
    - :relaxed? - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes.
    - fn? - function from str-> one of :tech.ml.dataset.parser/missing, :tech.ml.dataset.parser/parse-failure, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated.
    - string? - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For encoded-text, this has to be a valid argument to Charset/forName.
    - DateTimeFormatter - use with the appropriate temporal parse static function to parse the value.
map? - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used.

Returns a new dataset

Create a dataset from either csv/tsv or a sequence of maps.

 * A `String` be interpreted as a file (or gzipped file if it
   ends with .gz) of tsv or csv data.  The system will attempt to autodetect if this
   is csv or tsv and then engineering around detecting datatypes all of which can
   be overridden.

* InputStreams have no file type and thus a `file-type` must be provided in the
  options.

* A sequence of maps may be passed in in which case the first N maps are scanned in
  order to derive the column datatypes before the actual columns are created.

Parquet, xlsx, and xls formats require that you require the appropriate libraries
which are `tech.v3.libs.parquet` for parquet, `tech.v3.libs.fastexcel` for xlsx,
and `tech.v3.libs.poi` for xls.


Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type
overload as the Arrow project current has 3 different file types and it is not clear
what their final suffix will be or which of the three file types it will indicate.
Please see documentation in the `tech.v3.libs.arrow` namespace for further information
on Arrow file types.

Options:

- `:dataset-name` - set the name of the dataset.
- `:file-type` - Override filetype discovery mechanism for strings or force a particular
    parser for an input stream.  Note that parquet must have paths on disk
    and cannot currently load from input stream.  Acceptible file types are:
    #{:csv :tsv :xlsx :xls :parquet}.
- `:gzipped?` - for file formats that support it, override autodetection and force
   creation of a gzipped input stream as opposed to a normal input stream.
- `:column-whitelist` - either sequence of string column names or sequence of column
   indices of columns to whitelist.
- `:column-blacklist` - either sequence of string column names or sequence of column
   indices of columns to blacklist.
- `:num-rows` - Number of rows to read
- `:header-row?` - Defaults to true, indicates the first row is a header.
- `:key-fn` - function to be applied to column names.  Typical use is:
   `:key-fn keyword`.
- `:separator` - Add a character separator to the list of separators to auto-detect.
- `:csv-parser` - Implementation of univocity's AbstractParser to use.  If not
   provided a default permissive parser is used.  This way you parse anything that
   univocity supports (so flat files and such).
- `:bad-row-policy` - One of three options: :skip, :error, :carry-on.  Defaults to
   :carry-on.  Some csv data has ragged rows and in this case we have several
   options. If the option is :carry-on then we either create a new column or add
   missing values for columns that had no data for that row.
- `:skip-bad-rows?` - Legacy option.  Use :bad-row-policy.
- `:max-chars-per-column` - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
- `:max-num-columns` - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301
- `:n-initial-skip-rows` - Skip N rows initially.  This currently may include the header
   row.  Works across both csv and spreadsheet datasets.
- `:parser-fn` -
    - `keyword?` - all columns parsed to this datatype. For example: `{:parser-fn :string}`
    - `map?` - `{column-name parse-method}` parse each column with specified `parse-method`.
      The `parse-method` can be:
        - `keyword?` - parse the specified column to this datatype. For example:
          `{:parser-fn {:answer :boolean :id :int32}}`
        - tuple - pair of `[datatype parse-data]` in which case container of type
          `[datatype]` will be created. `parse-data` can be one of:
            - `:relaxed?` - data will be parsed such that parse failures of the standard
               parse functions do not stop the parsing process.  :unparsed-values and
               :unparsed-indexes are available in the metadata of the column that tell
               you the values that failed to parse and their respective indexes.
            - `fn?` - function from str-> one of `:tech.ml.dataset.parser/missing`,
               `:tech.ml.dataset.parser/parse-failure`, or the parsed value.
               Exceptions here always kill the parse process.  :missing will get marked
               in the missing indexes, and :parse-failure will result in the index being
               added to missing, the unparsed the column's :unparsed-values and
               :unparsed-indexes will be updated.
            - `string?` - for datetime types, this will turned into a DateTimeFormatter via
               DateTimeFormatter/ofPattern.  For encoded-text, this has to be a valid
               argument to Charset/forName.
            - `DateTimeFormatter` - use with the appropriate temporal parse static function
               to parse the value.
 - `map?` - the header-name-or-idx is used to lookup value.  If not nil, then
         value can be any of the above options.  Else the default column parser
         is used.

Returns a new dataset

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field