(infer-parsers filename)
(infer-parsers filename parsers-desc)
(infer-parsers filename parsers-desc options)
parsers-desc
can be used to specify parsers, with the description for each column containing its
data type(s) as well as parser function(s).
For a scalar-valued column, this takes the form ~[dtype fn]~, which can (currently) be specified in one of these two ways:
(keys mesalog.parse.parser/default-coercers)
.[:db.type/long #(long (Float/parseFloat %))]
.For a vector-valued column (whatever the ~:db/valueType~ of its corresponding attribute, if any), the following forms are possible:
parsers-desc
can be specified as:
Please see test namespace mesalog.parser-test
for usage examples.
`parsers-desc` can be used to specify parsers, with the description for each column containing its data type(s) as well as parser function(s). For a scalar-valued column, this takes the form ~[dtype fn]~, which can (currently) be specified in one of these two ways: - A default data type, say ~d~, as shorthand for ~[d (d mesalog.parse.parser/default-coercers)]~, with the 2nd element being its corresponding default parser function. The value of ~d~ must come from `(keys mesalog.parse.parser/default-coercers)`. - In full, as a two-element tuple of type and (custom) parser, e.g. `[:db.type/long #(long (Float/parseFloat %))]`. For a vector-valued column (whatever the ~:db/valueType~ of its corresponding attribute, if any), the following forms are possible: - ~[dtype parse-fn]~ (not supported for tuples) - ~[[dt1 dt2 ...]]~, if ~dt1~ etc. are all data types having default parsers - ~[[dt1 dt2 ...] [pfn1 pfn2 ...]]~, to specify custom parser functions. `parsers-desc` can be specified as: - A map with each element consisting of the following: - Key: a valid column identifier (see above) - Value: a parser description taking the form described above. - A vector specifying parsers for consecutive columns, starting from the 1st (though not necessarily ending at the last), with each element again being a parser description taking the form above, just like one given as a map value. Please see test namespace `mesalog.parser-test` for usage examples.
(load-csv filename conn)
(load-csv filename conn parsers-desc)
(load-csv filename conn parsers-desc schema-desc)
(load-csv filename conn parsers-desc schema-desc options)
Reads, parses, and loads data from CSV file named filename
into a Datahike database via
the connection conn
, with optional specifications in parsers-desc
, schema-desc
and options
.
Please note that the functionality (API and implementation) documented here, in particular aspects related to schema specification/inference and its interface with parser specification/inference, is still evolving and will undergo changes, possibly breaking, in the future.
Each column represents an attribute, with keywordized column name as default attribute ident, or
otherwise, an element in a tuple. Type and cardinality are automatically inferred, though they
sometimes require specification; in particular, cardinality many is well-defined and can only
be inferred in the presence of a separate attribute marked as unique (:db.unique/identity
or
:db.unique/value
).
Please see the docstring for infer-parsers
for detailed information on parsers-desc
.
schema-desc
can be used to specify schema fully or partially for attributes introduced by
filename
. It may be:
Key: Any of :db/isComponent
, :db/noHistory
, and :db/index
Value: Set of attribute idents
Description: Denotes a schema attribute value of true
Example: {:db/index #{:name}}
denotes a :db/index
value of true
for attribute :name
Key: Any element of the sets :db.type/value
, :db.type/cardinality
, and :db.type/unique
from namespace datahike.schema
, except :db.type.install/attribute
Value: Set of attribute idents
Description: The key denotes the corresponding schema attribute value for the attributes named
in the value. :db.type/tuple
and :db.type/ref
attributes have two possible forms of specification.
In this form, each attribute must correspond to a self-contained column, i.e. consist of sequences
for tuples, and lookup refs or entity IDs for refs. The other form is described below.
Examples:
{:db.type/keyword #{:kw}}
denotes :db/valueType
:db.type/keyword
for attribute :kw
.
{:db.cardinality/many #{:orders}}
denotes :db/cardinality
:db.cardinality/many
for :orders
.
{:db.unique/identity #{:email}}
denotes :db/unique
:db.unique/identity
for :email
.
Key: :db.type/ref
Value: Map of ref-type attribute idents to referenced attribute idents
Description: Each key-value pair maps a ref-type attribute to an attribute which uniquely
identifies referenced entities
Example: {:db.type/ref {:parent-station :station-id}}
denotes that the ref-type attribute
:parent-station
references entities with the unique identifier attribute :station-id
Key: :db.type/tuple
Value: Map of tuple attribute ident to sequence of keywordized column names
Description: Each key-value pair denotes a tuple attribute and the columns representing its elements
Example: {:db.type/tuple {:abc [:a :b :c]}}
denotes that the tuple attribute :abc
consists of
elements with values represented in columns :a
, :b
, and :c
Key: :db.type/compositeTuple
(a keyword not used in Datahike, but that serves here as a
shorthand to distinguish composite and ordinary tuples)
Value: Map of composite tuple attribute ident to constituent attribute idents (keywordized
column names)
Description: Each key-value pair denotes a composite tuple attribute and its constituent
attributes (each corresponding to a column)
Example: {:db.type/compositeTuple {:abc [:a :b :c]}}
: the composite tuple attribute :abc
consists of attributes (with corresponding columns) :a
, :b
, and :c
:db/ident
, :db/cardinality
(which is required) for each attribute must be specified, though
type is inferred if omitted.Lastly, options
supports the following keys:
:batch-size
: The number of rows to read and transact per batch (default 128,000
).:num-rows
: The number of rows in the CSV file.:separator
: Separator character for CSV row entries. Defaults to ,
.:parser-sample-size
: Number of rows to sample for type (parser) inference. Defaults to Long/MAX_VALUE
.:vector-delims-use
: Whether vector-valued entries are delimited, e.g. by square brackets ([]
).
Defaults to true
.:vector-open-char
: Left delimiter for vector values, only applicable if :vector-delims-use
is true
.
Default: [
.:vector-close-char
: Right delimiter for vector values, only applicable if :vector-delims-use
is true
.
Default: ]
.:vector-separator
: Separator character for elements in vector-valued entries, analogous to :separator
(default ,
) for CSV row entries. Defaults to the same value as that of :separator
.:include-cols
: Predicate for whether a column should be included in the data load. Columns can be
specified using valid index values, strings, or keywords. Example: #{1 2 3}
.:idx->colname
: Function taking the 0-based index of a column and returning name. Defaults to the
value at the same index of the column header if present, otherwise (str "column-" idx)
.:colname->ident
: Function taking the name of a column and returning a keyword, based on the convention of each
column representing an attribute, and keywordized column name as default attribute ident. The returned value is assumed
to be the corresponding ident for each column representing an attribute, though it can also apply to columns for which
that is not the case. Defaults to the keywordized column name, with consecutive spaces replaced by a single hyphen.Reads, parses, and loads data from CSV file named `filename` into a Datahike database via the connection `conn`, with optional specifications in `parsers-desc`, `schema-desc` and `options`. *Please note that the functionality (API and implementation) documented here, in particular aspects related to schema specification/inference and its interface with parser specification/inference, is still evolving and will undergo changes, possibly breaking, in the future.* Each column represents an attribute, with keywordized column name as default attribute ident, or otherwise, an element in a tuple. Type and cardinality are automatically inferred, though they sometimes require specification; in particular, cardinality many is well-defined and can only be inferred in the presence of a separate attribute marked as unique (`:db.unique/identity` or `:db.unique/value`). Please see the docstring for `infer-parsers` for detailed information on `parsers-desc`. `schema-desc` can be used to specify schema fully or partially for attributes introduced by `filename`. It may be: 1. A map, for partial specification: using schema attributes or schema attribute values as keys, each with a collection of attribute idents or keywordised column names as its corresponding value, in the following forms: *Key:* Any of `:db/isComponent`, `:db/noHistory`, and `:db/index` *Value:* Set of attribute idents *Description:* Denotes a schema attribute value of `true` *Example:* `{:db/index #{:name}}` denotes a `:db/index` value of `true` for attribute `:name` *Key:* Any element of the sets `:db.type/value`, `:db.type/cardinality`, and `:db.type/unique` from namespace `datahike.schema`, except `:db.type.install/attribute` *Value:* Set of attribute idents *Description:* The key denotes the corresponding schema attribute value for the attributes named in the value. `:db.type/tuple` and `:db.type/ref` attributes have two possible forms of specification. In this form, each attribute must correspond to a self-contained column, i.e. consist of sequences for tuples, and lookup refs or entity IDs for refs. The other form is described below. *Examples:* `{:db.type/keyword #{:kw}}` denotes `:db/valueType` `:db.type/keyword` for attribute `:kw`. `{:db.cardinality/many #{:orders}}` denotes `:db/cardinality` `:db.cardinality/many` for `:orders`. `{:db.unique/identity #{:email}}` denotes `:db/unique` `:db.unique/identity` for `:email`. *Key:* `:db.type/ref` *Value:* Map of ref-type attribute idents to referenced attribute idents *Description:* Each key-value pair maps a ref-type attribute to an attribute which uniquely identifies referenced entities *Example:* `{:db.type/ref {:parent-station :station-id}}` denotes that the ref-type attribute `:parent-station` references entities with the unique identifier attribute `:station-id` *Key:* `:db.type/tuple` *Value:* Map of tuple attribute ident to sequence of keywordized column names *Description:* Each key-value pair denotes a tuple attribute and the columns representing its elements *Example:* `{:db.type/tuple {:abc [:a :b :c]}}` denotes that the tuple attribute `:abc` consists of elements with values represented in columns `:a`, `:b`, and `:c` *Key:* `:db.type/compositeTuple` (a keyword not used in Datahike, but that serves here as a shorthand to distinguish composite and ordinary tuples) *Value:* Map of composite tuple attribute ident to constituent attribute idents (keywordized column names) *Description:* Each key-value pair denotes a composite tuple attribute and its constituent attributes (each corresponding to a column) *Example:* `{:db.type/compositeTuple {:abc [:a :b :c]}}`: the composite tuple attribute `:abc` consists of attributes (with corresponding columns) `:a`, `:b`, and `:c` 2. A vector of maps, of the form used for schema specification in Datahike. Still not well supported: besides `:db/ident`, `:db/cardinality` (which is required) for each attribute must be specified, though type is inferred if omitted. Lastly, `options` supports the following keys: - `:batch-size`: The number of rows to read and transact per batch (default `128,000`). - `:num-rows`: The number of rows in the CSV file. - `:separator`: Separator character for CSV row entries. Defaults to `,`. - `:parser-sample-size`: Number of rows to sample for type (parser) inference. Defaults to `Long/MAX_VALUE`. - `:vector-delims-use`: Whether vector-valued entries are delimited, e.g. by square brackets (`[]`). Defaults to `true`. - `:vector-open-char`: Left delimiter for vector values, only applicable if `:vector-delims-use` is `true`. Default: `[`. - `:vector-close-char`: Right delimiter for vector values, only applicable if `:vector-delims-use` is `true`. Default: `]`. - `:vector-separator`: Separator character for elements in vector-valued entries, analogous to `:separator` (default `,`) for CSV row entries. Defaults to the same value as that of `:separator`. - `:include-cols`: Predicate for whether a column should be included in the data load. Columns can be specified using valid index values, strings, or keywords. Example: `#{1 2 3}`. - `:idx->colname`: Function taking the 0-based index of a column and returning name. Defaults to the value at the same index of the column header if present, otherwise `(str "column-" idx)`. - `:colname->ident`: Function taking the name of a column and returning a keyword, based on the convention of each column representing an attribute, and keywordized column name as default attribute ident. The returned value is assumed to be the corresponding ident for each column representing an attribute, though it can also apply to columns for which that is not the case. Defaults to the keywordized column name, with consecutive spaces replaced by a single hyphen.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close