tech.ml.dataset.parse

Liking cljdoc? Tell your friends :D

Clojure only.

create-csv-parser
csv->dataset
csv->rows
default-parsers
dtype->missing-val
dtype->parse-fn
make-datetime-simple-parser
PColumnParser
PSimpleColumnParser
raw-row-iterable
rows->dataset
rows->n-row-sequences
simple-boolean-parser
simple-col-parser
simple-encoded-text-parser
simple-string-parser
simple-text-parser
test-file
write!

This file really should be named univocity.clj. But it is for parsing and writing csv and tsv data.

This file really should be named univocity.clj.  But it is for parsing and writing
csv and tsv data.

raw docstring

create-csv-parser^clj

(create-csv-parser
  {:keys [header-row? num-rows column-whitelist column-blacklist separator
          n-initial-skip-rows max-chars-per-column max-num-columns]
   :or {header-row? true max-chars-per-column (* 64 1024) max-num-columns 8192}
   :as options})

Create an implementation of univocity csv parser.

Create an implementation of univocity csv parser.

source raw docstring

csv->dataset^clj

(csv->dataset input)

(csv->dataset input options)

Non-lazily and serially parse the columns. Returns a vector of maps of { :name column-name :missing long-reader of in-order missing indexes :data typed reader/writer of data :metadata - optional map with unparsed-indexes and unparsed-values } Supports a subset of tech.ml.dataset/->dataset options: :column-whitelist :column-blacklist :n-initial-skip-rows :num-rows :header-row? :separator :parser-fn :parser-scan-len

Non-lazily and serially parse the columns.  Returns a vector of maps of
{
 :name column-name
 :missing long-reader of in-order missing indexes
 :data typed reader/writer of data
 :metadata - optional map with unparsed-indexes and unparsed-values
}
Supports a subset of tech.ml.dataset/->dataset options:
:column-whitelist
:column-blacklist
:n-initial-skip-rows
:num-rows
:header-row?
:separator
:parser-fn
:parser-scan-len

source raw docstring

csv->rows^clj

(csv->rows input)

(csv->rows input options)

Given a csv, produces a sequence of rows. The csv options from ->dataset apply here.

options: :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist. :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist. :num-rows - Number of rows to read :separator - Add a character separator to the list of separators to auto-detect. :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception. :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301

Given a csv, produces a sequence of rows.  The csv options from ->dataset
apply here.

options:
:column-whitelist - either sequence of string column names or sequence of column
   indices of columns to whitelist.
:column-blacklist - either sequence of string column names or sequence of column
   indices of columns to blacklist.
:num-rows - Number of rows to read
:separator - Add a character separator to the list of separators to auto-detect.
:max-chars-per-column - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
:max-num-columns - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301

source raw docstring

default-parsers^clj

source

dtype->missing-val^cljmacro

(dtype->missing-val datatype)

source

dtype->parse-fn^cljmacro

(dtype->parse-fn datatype val)

source

make-datetime-simple-parser^cljmacro

(make-datetime-simple-parser datatype)

source

PColumnParser^cljprotocol

column-data^clj

(column-data parser)

Return a map containing {:data - convertible-to-reader column data. :missing - convertible-to-reader array of missing values.

Return a map containing
{:data - convertible-to-reader column data.
 :missing - convertible-to-reader array of missing values.

missing!^clj

(missing! parser)

Mark a value as missing.

Mark a value as missing.

parse!^clj

(parse! parser str-val)

Side-effecting parse the value and store it. Exceptions escaping from here will stop the parsing system.

Side-effecting parse the value and store it.  Exceptions escaping from here
will stop the parsing system.

source

PSimpleColumnParser^cljprotocol

can-parse?^clj

(can-parse? parser str-val)

make-parser-container^clj

(make-parser-container parser)

simple-missing!^clj

(simple-missing! parser container)

simple-parse!^clj

(simple-parse! parser container str-val)

source

raw-row-iterable^clj

(raw-row-iterable input)

(raw-row-iterable input parser)

Returns an iterable that produces map of {:header-row - string[] :rows - iterable producing string[] rows }

Returns an iterable that produces
map of
{:header-row - string[]
 :rows - iterable producing string[] rows
}

source raw docstring

rows->dataset^clj

(rows->dataset {:keys [header-row? parser-fn parser-scan-len bad-row-policy
                       skip-bad-rows?]
                :or {header-row? true parser-scan-len 100}
                :as options}
               row-seq)

Given a sequence of string[] rows, parse into columnar data. See csv->columns. This method is useful if you have another way of generating sequences of string[] row data.

Given a sequence of string[] rows, parse into columnar data.
See csv->columns.
This method is useful if you have another way of generating sequences of
string[] row data.

source raw docstring

rows->n-row-sequences^clj

(rows->n-row-sequences row-seq)

(rows->n-row-sequences options row-seq)

(rows->n-row-sequences {:keys [header-row?] :or {header-row? true}} n row-seq)

Used for parallizing loading of a csv. Returns N sequences that fed from a single sequence of rows. Experimental - Not the most effectively way of speeding up loading.

Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.

Used for parallizing loading of a csv.  Returns N sequences that fed from a single
sequence of rows.  Experimental - Not the most effectively way of speeding up
loading.

Type-hinting your columns and providing specific parsers for datetime types like:
(ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})
may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than
single-file parallelization in most cases.

source raw docstring

simple-boolean-parser^clj

(simple-boolean-parser)

source

simple-col-parser^cljmacro

(simple-col-parser datatype)

source

simple-encoded-text-parser^clj

(simple-encoded-text-parser)

(simple-encoded-text-parser encoder)

source

simple-string-parser^clj

(simple-string-parser)

source

simple-text-parser^clj

(simple-text-parser)

source

test-file^clj

source

write!^clj

(write! output header-string-array row-string-array-seq)

(write! output
        header-string-array
        row-string-array-seq
        {:keys [separator] :or {separator \tab} :as options})

source

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close