mesalog.api

Liking cljdoc? Tell your friends :D

Clojure only.

infer-parsers
load-csv

infer-parsers^clj

(infer-parsers filename)

(infer-parsers filename parsers-desc)

(infer-parsers filename parsers-desc options)

parsers-desc can be used to specify parsers, with the description for each column containing its data type(s) as well as parser function(s).

For a scalar-valued column, this takes the form ~[dtype fn]~, which can (currently) be specified in one of these two ways:

A default data type, say ~d~, as shorthand for ~[d (d mesalog.parse.parser/default-coercers)]~, with the 2nd element being its corresponding default parser function. The value of ~d~ must come from (keys mesalog.parse.parser/default-coercers).
In full, as a two-element tuple of type and (custom) parser, e.g. [:db.type/long #(long (Float/parseFloat %))].

For a vector-valued column (whatever the ~:db/valueType~ of its corresponding attribute, if any), the following forms are possible:

~[dtype parse-fn]~ (not supported for tuples)
~[[dt1 dt2 ...]]~, if ~dt1~ etc. are all data types having default parsers
~[[dt1 dt2 ...] [pfn1 pfn2 ...]]~, to specify custom parser functions.

parsers-desc can be specified as:

A map with each element consisting of the following:
- Key: a valid column identifier (see above)
- Value: a parser description taking the form described above.
A vector specifying parsers for consecutive columns, starting from the 1st (though not necessarily ending at the last), with each element again being a parser description taking the form above, just like one given as a map value.

Please see test namespace mesalog.parser-test for usage examples.

`parsers-desc` can be used to specify parsers, with the description for each column containing its
data type(s) as well as parser function(s).

For a scalar-valued column, this takes the form ~[dtype fn]~, which can (currently) be specified in
one of these two ways:
- A default data type, say ~d~, as shorthand for ~[d (d mesalog.parse.parser/default-coercers)]~,
with the 2nd element being its corresponding default parser function. The value of ~d~ must come from
`(keys mesalog.parse.parser/default-coercers)`.
- In full, as a two-element tuple of type and (custom) parser, e.g.
`[:db.type/long #(long (Float/parseFloat %))]`.

For a vector-valued column (whatever the ~:db/valueType~ of its corresponding attribute, if any), the
following forms are possible:
- ~[dtype parse-fn]~ (not supported for tuples)
- ~[[dt1 dt2 ...]]~, if ~dt1~ etc. are all data types having default parsers
- ~[[dt1 dt2 ...] [pfn1 pfn2 ...]]~, to specify custom parser functions.

`parsers-desc` can be specified as:
- A map with each element consisting of the following:
  - Key: a valid column identifier (see above)
  - Value: a parser description taking the form described above.
- A vector specifying parsers for consecutive columns, starting from the 1st (though not necessarily ending
at the last), with each element again being a parser description taking the form above, just like one given
as a map value.

Please see test namespace `mesalog.parser-test` for usage examples.

source raw docstring

load-csv^clj

(load-csv filename conn)

(load-csv filename conn parsers-desc)

(load-csv filename conn parsers-desc schema-desc)

(load-csv filename conn parsers-desc schema-desc options)

Reads, parses, and loads data from CSV file named filename into a Datahike database via the connection conn, with optional specifications in parsers-desc, schema-desc and options.

Please note that the functionality (API and implementation) documented here, in particular aspects related to schema specification/inference and its interface with parser specification/inference, is still evolving and will undergo changes, possibly breaking, in the future.

Each column represents an attribute, with keywordized column name as default attribute ident, or otherwise, an element in a tuple. Type and cardinality are automatically inferred, though they sometimes require specification; in particular, cardinality many is well-defined and can only be inferred in the presence of a separate attribute marked as unique (:db.unique/identity or :db.unique/value).

Please see the docstring for infer-parsers for detailed information on parsers-desc.

schema-desc can be used to specify schema fully or partially for attributes introduced by filename. It may be:

A map, for partial specification: using schema attributes or schema attribute values as keys, each with a collection of attribute idents or keywordised column names as its corresponding value, in the following forms:

Key: Any of :db/isComponent, :db/noHistory, and :db/index Value: Set of attribute idents Description: Denotes a schema attribute value of true Example: {:db/index #{:name}} denotes a :db/index value of true for attribute :name

Key: Any element of the sets :db.type/value, :db.type/cardinality, and :db.type/unique from namespace datahike.schema, except :db.type.install/attribute Value: Set of attribute idents Description: The key denotes the corresponding schema attribute value for the attributes named in the value. :db.type/tuple and :db.type/ref attributes have two possible forms of specification. In this form, each attribute must correspond to a self-contained column, i.e. consist of sequences for tuples, and lookup refs or entity IDs for refs. The other form is described below. Examples: {:db.type/keyword #{:kw}} denotes :db/valueType :db.type/keyword for attribute :kw. {:db.cardinality/many #{:orders}} denotes :db/cardinality :db.cardinality/many for :orders. {:db.unique/identity #{:email}} denotes :db/unique :db.unique/identity for :email.

Key: :db.type/ref Value: Map of ref-type attribute idents to referenced attribute idents Description: Each key-value pair maps a ref-type attribute to an attribute which uniquely identifies referenced entities Example: {:db.type/ref {:parent-station :station-id}} denotes that the ref-type attribute :parent-station references entities with the unique identifier attribute :station-id

Key: :db.type/tuple Value: Map of tuple attribute ident to sequence of keywordized column names Description: Each key-value pair denotes a tuple attribute and the columns representing its elements Example: {:db.type/tuple {:abc [:a :b :c]}} denotes that the tuple attribute :abc consists of elements with values represented in columns :a, :b, and :c

Key: :db.type/compositeTuple (a keyword not used in Datahike, but that serves here as a shorthand to distinguish composite and ordinary tuples) Value: Map of composite tuple attribute ident to constituent attribute idents (keywordized column names) Description: Each key-value pair denotes a composite tuple attribute and its constituent attributes (each corresponding to a column) Example: {:db.type/compositeTuple {:abc [:a :b :c]}}: the composite tuple attribute :abc consists of attributes (with corresponding columns) :a, :b, and :c

A vector of maps, of the form used for schema specification in Datahike. Still not well supported: besides :db/ident, :db/cardinality (which is required) for each attribute must be specified, though type is inferred if omitted.

Lastly, options supports the following keys:

:batch-size: The number of rows to read and transact per batch (default 128,000).
:num-rows: The number of rows in the CSV file.
:separator: Separator character for CSV row entries. Defaults to ,.
:parser-sample-size: Number of rows to sample for type (parser) inference. Defaults to Long/MAX_VALUE.
:vector-delims-use: Whether vector-valued entries are delimited, e.g. by square brackets ([]). Defaults to true.
:vector-open-char: Left delimiter for vector values, only applicable if :vector-delims-use is true. Default: [.
:vector-close-char: Right delimiter for vector values, only applicable if :vector-delims-use is true. Default: ].
:vector-separator: Separator character for elements in vector-valued entries, analogous to :separator (default ,) for CSV row entries. Defaults to the same value as that of :separator.
:include-cols: Predicate for whether a column should be included in the data load. Columns can be specified using valid index values, strings, or keywords. Example: #{1 2 3}.
:idx->colname: Function taking the 0-based index of a column and returning name. Defaults to the value at the same index of the column header if present, otherwise (str "column-" idx).
:colname->ident: Function taking the name of a column and returning a keyword, based on the convention of each column representing an attribute, and keywordized column name as default attribute ident. The returned value is assumed to be the corresponding ident for each column representing an attribute, though it can also apply to columns for which that is not the case. Defaults to the keywordized column name, with consecutive spaces replaced by a single hyphen.

Reads, parses, and loads data from CSV file named `filename` into a Datahike database via
the connection `conn`, with optional specifications in `parsers-desc`, `schema-desc` and `options`.

*Please note that the functionality (API and implementation) documented here, in particular
aspects related to schema specification/inference and its interface with parser specification/inference,
is still evolving and will undergo changes, possibly breaking, in the future.*

Each column represents an attribute, with keywordized column name as default attribute ident, or
otherwise, an element in a tuple. Type and cardinality are automatically inferred, though they
sometimes require specification; in particular, cardinality many is well-defined and can only
be inferred in the presence of a separate attribute marked as unique (`:db.unique/identity` or
`:db.unique/value`).

Please see the docstring for `infer-parsers` for detailed information on `parsers-desc`.

`schema-desc` can be used to specify schema fully or partially for attributes introduced by
`filename`. It may be:

1. A map, for partial specification: using schema attributes or schema attribute values as keys,
each with a collection of attribute idents or keywordised column names as its corresponding value,
in the following forms:

*Key:* Any of `:db/isComponent`, `:db/noHistory`, and `:db/index`
*Value:* Set of attribute idents
*Description:* Denotes a schema attribute value of `true`
*Example:* `{:db/index #{:name}}` denotes a `:db/index` value of `true` for attribute `:name`

*Key:* Any element of the sets `:db.type/value`, `:db.type/cardinality`, and `:db.type/unique`
from namespace `datahike.schema`, except `:db.type.install/attribute`
*Value:* Set of attribute idents
*Description:* The key denotes the corresponding schema attribute value for the attributes named
in the value. `:db.type/tuple` and `:db.type/ref` attributes have two possible forms of specification.
In this form, each attribute must correspond to a self-contained column, i.e. consist of sequences
for tuples, and lookup refs or entity IDs for refs. The other form is described below.
*Examples:*
`{:db.type/keyword #{:kw}}` denotes `:db/valueType` `:db.type/keyword` for attribute `:kw`.
`{:db.cardinality/many #{:orders}}` denotes `:db/cardinality` `:db.cardinality/many` for `:orders`.
`{:db.unique/identity #{:email}}` denotes `:db/unique` `:db.unique/identity` for `:email`.

*Key:* `:db.type/ref`
*Value:* Map of ref-type attribute idents to referenced attribute idents
*Description:* Each key-value pair maps a ref-type attribute to an attribute which uniquely
identifies referenced entities
*Example:* `{:db.type/ref {:parent-station :station-id}}` denotes that the ref-type attribute
`:parent-station` references entities with the unique identifier attribute `:station-id`

*Key:* `:db.type/tuple`
*Value:* Map of tuple attribute ident to sequence of keywordized column names
*Description:* Each key-value pair denotes a tuple attribute and the columns representing its elements
*Example:* `{:db.type/tuple {:abc [:a :b :c]}}` denotes that the tuple attribute `:abc` consists of
elements with values represented in columns `:a`, `:b`, and `:c`

*Key:* `:db.type/compositeTuple` (a keyword not used in Datahike, but that serves here as a
shorthand to distinguish composite and ordinary tuples)
*Value:* Map of composite tuple attribute ident to constituent attribute idents (keywordized
column names)
*Description:* Each key-value pair denotes a composite tuple attribute and its constituent
attributes (each corresponding to a column)
*Example:* `{:db.type/compositeTuple {:abc [:a :b :c]}}`: the composite tuple attribute `:abc`
consists of attributes (with corresponding columns) `:a`, `:b`, and `:c`

2. A vector of maps, of the form used for schema specification in Datahike. Still not well supported:
besides `:db/ident`, `:db/cardinality` (which is required) for each attribute must be specified, though
type is inferred if omitted.

Lastly, `options` supports the following keys:
- `:batch-size`: The number of rows to read and transact per batch (default `128,000`).
- `:num-rows`: The number of rows in the CSV file.
- `:separator`: Separator character for CSV row entries. Defaults to `,`.
- `:parser-sample-size`: Number of rows to sample for type (parser) inference. Defaults to `Long/MAX_VALUE`.
- `:vector-delims-use`: Whether vector-valued entries are delimited, e.g. by square brackets (`[]`).
Defaults to `true`.
- `:vector-open-char`: Left delimiter for vector values, only applicable if `:vector-delims-use` is `true`.
Default: `[`.
- `:vector-close-char`: Right delimiter for vector values, only applicable if `:vector-delims-use` is `true`.
Default: `]`.
- `:vector-separator`: Separator character for elements in vector-valued entries, analogous to `:separator`
(default `,`) for CSV row entries. Defaults to the same value as that of `:separator`.
- `:include-cols`: Predicate for whether a column should be included in the data load. Columns can be
specified using valid index values, strings, or keywords. Example: `#{1 2 3}`.
- `:idx->colname`: Function taking the 0-based index of a column and returning name. Defaults to the
value at the same index of the column header if present, otherwise `(str "column-" idx)`.
- `:colname->ident`: Function taking the name of a column and returning a keyword, based on the convention of each
column representing an attribute, and keywordized column name as default attribute ident. The returned value is assumed
to be the corresponding ident for each column representing an attribute, though it can also apply to columns for which
that is not the case. Defaults to the keywordized column name, with consecutive spaces replaced by a single hyphen.

source raw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

mesalog.api

infer-parsersclj

load-csvclj

infer-parsers^clj

load-csv^clj