Loads CSV data into Datahike (or on GitHub) with a single function call.
(require '[datahike.api :as d]
'[datahike-csv-loader.core :as dcsv])
(dcsv/load-csv "data.csv")
;; or
(def cfg {:store ...})
(dcsv/load-csv "data.csv" cfg)
;; or (map contents elided here and described below)
(dcsv/load-csv "data.csv" cfg {:schema [{:db/ident :name
...}
...]
:ref-map {...}
:tuple-map {...}
:composite-tuple-map {...}})
;; or (ditto)
(dcsv/load-csv "data.csv" cfg {:schema {:unique-id #{...}
...}
:ref-map {...}
:tuple-map {...}
:composite-tuple-map {...}})
Reads, parses, and loads data from data.csv into the Datahike database having (optionally specified) config cfg
, with likewise optional schema-related options for the corresponding attributes. Each column represents an attribute, with keywordized column name as attribute ident, or otherwise, an element in a heterogeneous or homogeneous tuple (more on tuples below).
If cfg
is omitted, and the last argument:
:schema
, :ref-map
, and :composite-tuple-map
, cfg
is inferred to be {:schema-flexibility :read}
.:schema
, :ref-map
, and :composite-tuple-map
, cfg
is inferred to be {}
, i.e. the default.:schema
in the last argument can be specified in one of two ways:
datahike.api/reverse-schema
, albeit with slightly different keys, each having a set of attribute idents as the corresponding value. Available options:Key | Description |
---|---|
:unique-id | :db/unique value :db.unique/identity |
:unique-val | :db/unique value :db.unique/value |
:index | :db/index value true |
:cardinality-many | :db/cardinality value :db.cardinality/many |
Ref- and tuple-valued attributes, i.e. those with :db/valueType
:db.type/ref
or :db.type/tuple
, are however specified separately, via :ref-map
, :tuple-map
, or :composite-tuple-map
, each a map as follows:
Key | Description |
---|---|
:ref-map | :db.type/ref attribute idents to referenced attribute idents |
:composite-tuple-map | Composite :db.type/tuple attribute idents to constituent attribute idents |
:tuple-map | Other (homogeneous, heterogeneous) :db.type/tuple attribute idents to constituent attribute idents |
Each file is assumed to represent attributes for one entity "type", whether new or existing: e.g. a student with columns student/name, student/id. This also means that attribute data for a single "type" can be loaded from multiple files: for example, another file with columns student/id and student/course can be loaded later. As indicated above, attribute :schema
can take two forms--full specification as Datahike transaction data, or instead, partial specification via a map similar to the one returned by d/reverse-schema
: for example, a value of #{:user/email :user/account-id}
for the key :unique-id
indicates that the attributes in the set are unique identifiers. Unspecified schema attribute values are defaults or inferred from the data given: for instance, except with :db.type/ref
and :db.type/tuple
, :db/valueType
is inferred. Note also that only one cardinality-many attribute is allowed per file for semantic reasons. Examples in the rest of this document use the partial specification style for brevity.
load-csv
also handles data for attributes already present in the schema, e.g. if a file with identical or overlapping column names was loaded earlier, in which case the corresponding columns should be left out of :schema
, although they would be excluded anyway from any schema transaction before the data proper is loaded. That said, this behaviour hasn't yet been tested, so caution is advised.
Data in a reference-valued attribute column must consist of domain identifier (i.e. an attribute with :db.unique/identity
) values for entities already present in the database; these are automatically converted into entity IDs. For example:
(d/transact conn [{:db/ident :course/id
:db/unique :db.unique/identity
...}])
(d/transact conn [{:course/id "CMSC101"
:course/name "Intro. to CS"}
...])
(dcsv/load-csv "students.csv" cfg {:schema {:unique-id #{:student/id}
:cardinality-many #{:student/course}}
:ref-map {:student/course :course/id}})
;; values for :student/course will consist of their corresponding course entity IDs
With CSV contents such as:
student/id | student/course |
---|---|
1 | CMSC101 |
1 | MATH101 |
1 | MUSI101 |
2 | PHYS101 |
2 | ... |
Support for loading entity IDs directly can be added if observations of such use cases in the wild are reported.
First: an introduction to tuples for the uninitiated.
load-csv
can load data from multiple columns into any of the three kinds of tuples available in Datahike (as in Datomic, for which the documentation just linked to is written): composite, heterogeneous, and homogeneous. Composite tuples are automatically created and transacted by the database when their constituent attributes are transacted (and retained independent of the tuple attribute); heterogeneous and homogeneous tuples consist instead of user-created vectors, with no independent constituent attributes.
In addition to :schema
specification as necessary, the ident of and columns belonging to each tuple need to be specified. This should be done via :composite-tuple-map
or :tuple-map
as appropriate, with key-value pairs each consisting of a tuple attribute ident and a vector of corresponding attribute idents (for composite tuples) or keywordized column names (for other tuple types); whether each entry in :tuple-map
is homogeneous or heterogeneous is inferred from corresponding column data value types.
For example, roughly working off this schema definition, to create a composite tuple attribute from attributes (columns) student/id, course/id, and semester/year+season:
(load-csv "data.csv" cfg {:composite-tuple-map
{:reg/semester+course+student [:student/id :course/id :semester/year+season]}
...})
This results in four separate attributes; a full :schema
specification should include all of them (whereas a partial specification should of course only feature attributes as needed).
Another example, of creating a homogeneous tuple using columns station/lat and station/lon:
(load-csv "data.csv" cfg {:tuple-map {:station/coordinates [:station/lat :station/lon]}
...})
})
Here, the data in these columns are merged and transacted only as :station/coordinates
; a full :schema
specification need only include the tuple attribute.
As implied above, the db/valueType
of tuple elements is inferred unless specified in :schema
.
datahike-csv-loader currently doesn't support:
We plan to address these shortcomings, and any others that arise, if they prove to be substantial.
Copyright © 2022 Yee Fay Lim.
Distributed under the Eclipse Public License version 1.0.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close