(columns ds)
(columns ds result-type)
Returns columns of dataset. Result type can be any of:
:as-map
:as-double-arrays
:as-seqs
Returns columns of dataset. Result type can be any of: * `:as-map` * `:as-double-arrays` * `:as-seqs`
(concat dataset & datasets)
Joins rows from other datasets
Joins rows from other datasets
(concat-copying dataset & datasets)
Joins rows from other datasets via a copy of data
Joins rows from other datasets via a copy of data
(dataset)
(dataset data)
(dataset data
{:keys [single-value-column-name column-names layout dataset-name
stack-trace? error-column?]
:or {single-value-column-name :$value
layout :as-rows
stack-trace? false
error-column? true}
:as options})
Create a dataset
.
Dataset can be created from:
Single value is set only when it's not possible to find a path for given data. If tech.ml.dataset throws an exception, it's won;t be printed. To print a stack trace, set stack-trace?
option to true
.
ds/->dataset documentation:
Create a dataset from either csv/tsv or a sequence of maps.
A String
be interpreted as a file (or gzipped file if it
ends with .gz) of tsv or csv data. The system will attempt to autodetect if this
is csv or tsv and then engineering around detecting datatypes all of which can
be overridden.
InputStreams have no file type and thus a file-type
must be provided in the
options.
A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created.
Parquet, xlsx, and xls formats require that you require the appropriate libraries
which are tech.v3.libs.parquet
for parquet, tech.v3.libs.fastexcel
for xlsx,
and tech.v3.libs.poi
for xls.
Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type
overload as the Arrow project current has 3 different file types and it is not clear
what their final suffix will be or which of the three file types it will indicate.
Please see documentation in the tech.v3.libs.arrow
namespace for further information
on Arrow file types.
Options:
:dataset-name
- set the name of the dataset.
:file-type
- Override filetype discovery mechanism for strings or force a particular
parser for an input stream. Note that parquet must have paths on disk
and cannot currently load from input stream. Acceptible file types are:
#{:csv :tsv :xlsx :xls :parquet}.
:gzipped?
- for file formats that support it, override autodetection and force
creation of a gzipped input stream as opposed to a normal input stream.
:column-allowlist
- either sequence of string column names or sequence of column
indices of columns to allowlist. This is preferred to :column-whitelist
:column-blocklist
- either sequence of string column names or sequence of column
indices of columns to blocklist. This is preferred to :column-blacklist
:num-rows
- Number of rows to read
:header-row?
- Defaults to true, indicates the first row is a header.
:key-fn
- function to be applied to column names. Typical use is:
:key-fn keyword
.
:separator
- Add a character separator to the list of separators to auto-detect.
:csv-parser
- Implementation of univocity's AbstractParser to use. If not
provided a default permissive parser is used. This way you parse anything that
univocity supports (so flat files and such).
:bad-row-policy
- One of three options: :skip, :error, :carry-on. Defaults to
:carry-on. Some csv data has ragged rows and in this case we have several
options. If the option is :carry-on then we either create a new column or add
missing values for columns that had no data for that row.
:skip-bad-rows?
- Legacy option. Use :bad-row-policy.
:disable-comment-skipping?
- As default, the #
character is recognised as a
line comment when found in the beginning of a line of text in a CSV file,
and the row will be ignored. Set true
to disable this behavior.
:max-chars-per-column
- Defaults to 4096. Columns with more characters that this
will result in an exception.
:max-num-columns
- Defaults to 8192. CSV,TSV files with more columns than this
will fail to parse. For more information on this option, please visit:
https://github.com/uniVocity/univocity-parsers/issues/301
:text-temp-dir
- The temporary directory to use for file-backed text. Setting
this value to boolean 'false' turns off file backed text which is the default. If a
tech.v3.resource stack context is opened the file will be deleted when the context
closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is
added as a last resort to ensure the file is cleaned up.
:n-initial-skip-rows
- Skip N rows initially. This currently may include the
header row. Works across both csv and spreadsheet datasets.
:parser-type
- Default parser to use if no parser-fn is specified for that column.
For csv files, the default parser type is :string
which indicates a promotional
string parser. For sequences of maps, the default parser type is :object. It can
be useful in some contexts to use the :string
parser with sequences of maps or
maps of columns.
:parser-fn
-
keyword?
- all columns parsed to this datatype. For example:
{:parser-fn :string}
map?
- {column-name parse-method}
parse each column with specified
parse-method
.
The parse-method
can be:
keyword?
- parse the specified column to this datatype. For example:
{:parser-fn {:answer :boolean :id :int32}}
[datatype parse-data]
in which case container of type
[datatype]
will be created. parse-data
can be one of:
:relaxed?
- data will be parsed such that parse failures of the standard
parse functions do not stop the parsing process. :unparsed-values and
:unparsed-indexes are available in the metadata of the column that tell
you the values that failed to parse and their respective indexes.fn?
- function from str-> one of :tech.v3.dataset/missing
,
:tech.v3.dataset/parse-failure
, or the parsed value.
Exceptions here always kill the parse process. :missing will get marked
in the missing indexes, and :parse-failure will result in the index being
added to missing, the unparsed the column's :unparsed-values and
:unparsed-indexes will be updated.string?
- for datetime types, this will turned into a DateTimeFormatter via
DateTimeFormatter/ofPattern. For :text
you can specify the backing file
to use.DateTimeFormatter
- use with the appropriate temporal parse static function
to parse the value.map?
- the header-name-or-idx is used to lookup value. If not nil, then
value can be any of the above options. Else the default column parser
is used.
Returns a new dataset
Create a `dataset`. Dataset can be created from: * map of values and/or sequences * sequence of maps * sequence of columns * file or url * array of arrays * single value Single value is set only when it's not possible to find a path for given data. If tech.ml.dataset throws an exception, it's won;t be printed. To print a stack trace, set `stack-trace?` option to `true`. ds/->dataset documentation: Create a dataset from either csv/tsv or a sequence of maps. * A `String` be interpreted as a file (or gzipped file if it ends with .gz) of tsv or csv data. The system will attempt to autodetect if this is csv or tsv and then engineering around detecting datatypes all of which can be overridden. * InputStreams have no file type and thus a `file-type` must be provided in the options. * A sequence of maps may be passed in in which case the first N maps are scanned in order to derive the column datatypes before the actual columns are created. Parquet, xlsx, and xls formats require that you require the appropriate libraries which are `tech.v3.libs.parquet` for parquet, `tech.v3.libs.fastexcel` for xlsx, and `tech.v3.libs.poi` for xls. Arrow support is provided via the tech.v3.libs.Arrow namespace not via a file-type overload as the Arrow project current has 3 different file types and it is not clear what their final suffix will be or which of the three file types it will indicate. Please see documentation in the `tech.v3.libs.arrow` namespace for further information on Arrow file types. Options: - `:dataset-name` - set the name of the dataset. - `:file-type` - Override filetype discovery mechanism for strings or force a particular parser for an input stream. Note that parquet must have paths on disk and cannot currently load from input stream. Acceptible file types are: #{:csv :tsv :xlsx :xls :parquet}. - `:gzipped?` - for file formats that support it, override autodetection and force creation of a gzipped input stream as opposed to a normal input stream. - `:column-allowlist` - either sequence of string column names or sequence of column indices of columns to allowlist. This is preferred to `:column-whitelist` - `:column-blocklist` - either sequence of string column names or sequence of column indices of columns to blocklist. This is preferred to `:column-blacklist` - `:num-rows` - Number of rows to read - `:header-row?` - Defaults to true, indicates the first row is a header. - `:key-fn` - function to be applied to column names. Typical use is: `:key-fn keyword`. - `:separator` - Add a character separator to the list of separators to auto-detect. - `:csv-parser` - Implementation of univocity's AbstractParser to use. If not provided a default permissive parser is used. This way you parse anything that univocity supports (so flat files and such). - `:bad-row-policy` - One of three options: :skip, :error, :carry-on. Defaults to :carry-on. Some csv data has ragged rows and in this case we have several options. If the option is :carry-on then we either create a new column or add missing values for columns that had no data for that row. - `:skip-bad-rows?` - Legacy option. Use :bad-row-policy. - `:disable-comment-skipping?` - As default, the `#` character is recognised as a line comment when found in the beginning of a line of text in a CSV file, and the row will be ignored. Set `true` to disable this behavior. - `:max-chars-per-column` - Defaults to 4096. Columns with more characters that this will result in an exception. - `:max-num-columns` - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301 - `:text-temp-dir` - The temporary directory to use for file-backed text. Setting this value to boolean 'false' turns off file backed text which is the default. If a tech.v3.resource stack context is opened the file will be deleted when the context closes else it will be deleted when the gc cleans up the dataset. A shutdown hook is added as a last resort to ensure the file is cleaned up. - `:n-initial-skip-rows` - Skip N rows initially. This currently may include the header row. Works across both csv and spreadsheet datasets. - `:parser-type` - Default parser to use if no parser-fn is specified for that column. For csv files, the default parser type is `:string` which indicates a promotional string parser. For sequences of maps, the default parser type is :object. It can be useful in some contexts to use the `:string` parser with sequences of maps or maps of columns. - `:parser-fn` - - `keyword?` - all columns parsed to this datatype. For example: `{:parser-fn :string}` - `map?` - `{column-name parse-method}` parse each column with specified `parse-method`. The `parse-method` can be: - `keyword?` - parse the specified column to this datatype. For example: `{:parser-fn {:answer :boolean :id :int32}}` - tuple - pair of `[datatype parse-data]` in which case container of type `[datatype]` will be created. `parse-data` can be one of: - `:relaxed?` - data will be parsed such that parse failures of the standard parse functions do not stop the parsing process. :unparsed-values and :unparsed-indexes are available in the metadata of the column that tell you the values that failed to parse and their respective indexes. - `fn?` - function from str-> one of `:tech.v3.dataset/missing`, `:tech.v3.dataset/parse-failure`, or the parsed value. Exceptions here always kill the parse process. :missing will get marked in the missing indexes, and :parse-failure will result in the index being added to missing, the unparsed the column's :unparsed-values and :unparsed-indexes will be updated. - `string?` - for datetime types, this will turned into a DateTimeFormatter via DateTimeFormatter/ofPattern. For `:text` you can specify the backing file to use. - `DateTimeFormatter` - use with the appropriate temporal parse static function to parse the value. - `map?` - the header-name-or-idx is used to lookup value. If not nil, then value can be any of the above options. Else the default column parser is used. Returns a new dataset
(get-entry ds column row)
Returns a single value from given column and row
Returns a single value from given column and row
(info ds)
(info ds result-type)
Returns a statistcial information about the columns of a dataset.
result-type
can be :descriptive or :columns
Returns a statistcial information about the columns of a dataset. `result-type ` can be :descriptive or :columns
(print-dataset ds)
(print-dataset ds options)
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
Prints dataset into console. For options see tech.v3.dataset.print/dataset-data->str
(rows ds)
(rows ds result-type)
(rows ds result-type {:keys [nil-missing?] :or {nil-missing? true} :as options})
Returns rows of dataset. Result type can be any of:
:as-maps
- maps:as-double-arrays
- double arrays:as-seqs
- reader (sequence, default):as-vecs
- vectorsIf you want to elide nils in maps set :nil-missing?
option to false (default: true
).
Another option - :copying?
- when true row values are copied on read (default: false
).
Returns rows of dataset. Result type can be any of: * `:as-maps` - maps * `:as-double-arrays` - double arrays * `:as-seqs` - reader (sequence, default) * `:as-vecs` - vectors If you want to elide nils in maps set `:nil-missing?` option to false (default: `true`). Another option - `:copying?` - when true row values are copied on read (default: `false`).
(shape ds)
Returns shape of the dataset [rows, cols]
Returns shape of the dataset [rows, cols]
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close