tech.ml.dataset.parse

Liking cljdoc? Tell your friends :D

Clojure only.

This file really should be named univocity.clj. But it is for parsing and writing csv and tsv data.

This file really should be named univocity.clj.  But it is for parsing and writing
csv and tsv data.

raw docstring

all-parsers^clj

source

attempt-general-parse!^clj

(attempt-general-parse! add-fn!
                        parse-fn
                        container
                        add-missing-fn
                        unparsed-data
                        unparsed-indexes
                        str-val)

source

attempt-simple-parse!^clj

(attempt-simple-parse! parse-add-fn!
                       simple-parser
                       container
                       add-missing-fn
                       unparsed-data
                       unparsed-indexes
                       relaxed?
                       str-val)

source

cheap-missing-value-map^clj

(cheap-missing-value-map keys missing-value)

source

convert-reader-to-strings^clj

(convert-reader-to-strings input-rdr)

This function has to take into account bad data and just return missing values in the case where a reader conversion fails.

This function has to take into account bad data and just return
missing values in the case where a reader conversion fails.

source raw docstring

create-csv-parser^clj

(create-csv-parser
  {:keys [header-row? num-rows column-whitelist column-blacklist separator
          n-initial-skip-rows max-chars-per-column max-num-columns]
   :or {header-row? true max-chars-per-column (* 64 1024) max-num-columns 8192}
   :as options})

source

csv->dataset^clj

(csv->dataset input)

(csv->dataset input options)

Non-lazily and serially parse the columns. Returns a vector of maps of { :name column-name :missing long-reader of in-order missing indexes :data typed reader/writer of data :metadata - optional map with unparsed-indexes and unparsed-values } Supports a subset of tech.ml.dataset/->dataset options: :column-whitelist :column-blacklist :n-initial-skip-rows :num-rows :header-row? :separator :parser-fn :parser-scan-len

Non-lazily and serially parse the columns.  Returns a vector of maps of
{
 :name column-name
 :missing long-reader of in-order missing indexes
 :data typed reader/writer of data
 :metadata - optional map with unparsed-indexes and unparsed-values
}
Supports a subset of tech.ml.dataset/->dataset options:
:column-whitelist
:column-blacklist
:n-initial-skip-rows
:num-rows
:header-row?
:separator
:parser-fn
:parser-scan-len

source raw docstring

csv->rows^clj

(csv->rows input)

(csv->rows input options)

Given a csv, produces a sequence of rows. The csv options from ->dataset apply here.

options: :column-whitelist - either sequence of string column names or sequence of column indices of columns to whitelist. :column-blacklist - either sequence of string column names or sequence of column indices of columns to blacklist. :num-rows - Number of rows to read :separator - Add a character separator to the list of separators to auto-detect. :max-chars-per-column - Defaults to 4096. Columns with more characters that this will result in an exception. :max-num-columns - Defaults to 8192. CSV,TSV files with more columns than this will fail to parse. For more information on this option, please visit: https://github.com/uniVocity/univocity-parsers/issues/301

Given a csv, produces a sequence of rows.  The csv options from ->dataset
apply here.

options:
:column-whitelist - either sequence of string column names or sequence of column
   indices of columns to whitelist.
:column-blacklist - either sequence of string column names or sequence of column
   indices of columns to blacklist.
:num-rows - Number of rows to read
:separator - Add a character separator to the list of separators to auto-detect.
:max-chars-per-column - Defaults to 4096.  Columns with more characters that this
   will result in an exception.
:max-num-columns - Defaults to 8192.  CSV,TSV files with more columns than this
   will fail to parse.  For more information on this option, please visit:
   https://github.com/uniVocity/univocity-parsers/issues/301

source raw docstring

datetime-formatter-parser^clj

(datetime-formatter-parser datatype format-string-or-formatter)

source

default-column-parser^clj

(default-column-parser)

source

default-parser-seq^clj

source

dtype->missing-val^cljmacro

(dtype->missing-val datatype)

source

dtype->parse-fn^cljmacro

(dtype->parse-fn datatype val)

source

general-parser^clj

(general-parser datatype parse-fn)

source

make-datetime-simple-parser^cljmacro

(make-datetime-simple-parser datatype)

source

make-parser^clj

(make-parser parser-fn header-row-name scan-rows)

(make-parser parser-fn
             header-row-name
             scan-rows
             default-column-parser-fn
             simple-parser->parser-fn
             datetime-formatter-fn
             general-parser-fn)

source

on-parse-failure!^clj

(on-parse-failure! str-val
                   cur-idx
                   add-missing-fn
                   unparsed-data
                   unparsed-indexes)

source

PColumnParser^cljprotocol

column-data^clj

(column-data parser)

Return a map containing {:data - convertible-to-reader column data. :missing - convertible-to-reader array of missing values.

Return a map containing
{:data - convertible-to-reader column data.
 :missing - convertible-to-reader array of missing values.

missing!^clj

(missing! parser)

Mark a value as missing.

Mark a value as missing.

parse!^clj

(parse! parser str-val)

Side-effecting parse the value and store it. Exceptions escaping from here will stop the parsing system.

Side-effecting parse the value and store it.  Exceptions escaping from here
will stop the parsing system.

source

PSimpleColumnParser^cljprotocol

can-parse?^clj

(can-parse? parser str-val)

make-parser-container^clj

(make-parser-container parser)

simple-missing!^clj

(simple-missing! parser container)

simple-parse!^clj

(simple-parse! parser container str-val)

source

raw-row-iterable^clj

(raw-row-iterable input)

(raw-row-iterable input parser)

Returns an iterable that produces map of {:header-row - string[] :rows - iterable producing string[] rows }

Returns an iterable that produces
map of
{:header-row - string[]
 :rows - iterable producing string[] rows
}

source raw docstring

return-parse-data^clj

(return-parse-data container missing unparsed-data unparsed-indexes)

source

rows->dataset^clj

(rows->dataset {:keys [header-row? parser-fn parser-scan-len bad-row-policy
                       skip-bad-rows?]
                :or {header-row? true parser-scan-len 100}
                :as options}
               row-seq)

Given a sequence of string[] rows, parse into columnar data. See csv->columns. This method is useful if you have another way of generating sequences of string[] row data.

Given a sequence of string[] rows, parse into columnar data.
See csv->columns.
This method is useful if you have another way of generating sequences of
string[] row data.

source raw docstring

rows->n-row-sequences^clj

(rows->n-row-sequences row-seq)

(rows->n-row-sequences options row-seq)

(rows->n-row-sequences {:keys [header-row?] :or {header-row? true}} n row-seq)

Used for parallizing loading of a csv. Returns N sequences that fed from a single sequence of rows. Experimental - Not the most effectively way of speeding up loading.

Type-hinting your columns and providing specific parsers for datetime types like: (ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}}) may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than single-file parallelization in most cases.

Used for parallizing loading of a csv.  Returns N sequences that fed from a single
sequence of rows.  Experimental - Not the most effectively way of speeding up
loading.

Type-hinting your columns and providing specific parsers for datetime types like:
(ds/->dataset input {:parser-fn {"date" [:packed-local-date "yyyy-MM-dd"]}})
may have a larger effect than parallelization in most cases.

Loading multiple files in parallel will also have a larger effect than
single-file parallelization in most cases.

source raw docstring

simple-boolean-parser^clj

(simple-boolean-parser)

source

simple-col-parser^cljmacro

(simple-col-parser datatype)

source

simple-encoded-text-parser^clj

(simple-encoded-text-parser)

source

simple-parser->parser^clj

(simple-parser->parser parser-kwd-or-simple-parser)

(simple-parser->parser parser-kwd-or-simple-parser relaxed?)

source

simple-string-parser^clj

(simple-string-parser)

source

simple-text-parser^clj

(simple-text-parser)

source

test-file^clj

source

write!^clj

(write! output header-string-array row-string-array-seq)

(write! output
        header-string-array
        row-string-array-seq
        {:keys [separator] :or {separator \tab} :as options})

source

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close