Liking cljdoc? Tell your friends :D

datahike-csv-loader

Loads CSV data into Datahike (see also its GitHub repository) with a single function call.

A summary of the information below is also available: cljdoc badge

Usage

(require '[datahike.api :as d]
         '[datahike-csv-loader.core :as dcsv])

(d/create-database)
(def conn (d/connect))
(dcsv/load-csv conn "data.csv")

;; or (map contents elided here and described below)
(def col-schema {...})
(dcsv/load-csv conn "data.csv" col-schema)

Reads, parses, and loads data from data.csv into the database pointed to by conn, with schema for the corresponding attributes optionally specified in map col-schema. Each column represents an attribute, with keywordized column name as attribute ident, or an element in a heterogeneous or homogeneous tuple (more on tuples below).

col-schema expects a set of attribute idents as the value of each key, except :ref and :tuple. The available options are:

KeyDescription
:unique-id:db/unique value :db.unique/identity
:unique-val:db/unique value :db.unique/value
:index:db/index value true
:cardinality-many:db/cardinality value :db.cardinality/many
:refMap of :db/valueType :db.type/ref attribute idents to referenced attribute idents
:tupleMap of :db/valueType :db.type/tuple attribute idents to constituent attribute idents

Each file is assumed to represent attributes for one entity "type", whether new or existing: e.g. a student with columns student/name, student/id. This also means that attribute data for a single "type" can be loaded from multiple files: for example, another file with columns student/id and student/course can be loaded later. Attribute schema can be partially specified via col-schema: for example, a value of #{:user/email :user/acct-id} for the key :unique-id indicates that the attributes in the set are unique identifiers. That said, except with :db.type/ref and :db.type/tuple, :db/valueType is inferred. Note also that only one cardinality-many attribute is allowed per file for semantic reasons.

Ref-valued attributes

Data in a reference-valued attribute column must consist of domain identifier (i.e. an attribute with :db.unique/identity) values for entities already present in the database; these are automatically converted into entity IDs. For example:

(d/transact conn [{:db/ident :course/id
                   :db/unique :db.unique/identity
                   ...}])
(d/transact conn [{:course/id "CMSC101"
                   :course/name "Intro. to CS"}
                   ...])
(dcsv/load-csv conn "students.csv" {:unique-id #{:student/id}
                                    :cardinality-many #{:student/course}
                                    :ref {:student/course :course/id}})
;; values for :student/course will consist of their corresponding course entity IDs 

With CSV contents such as:

student/idstudent/course
1CMSC101
1MATH101
1MUSI101
2PHYS101
2...

Support for loading entity IDs directly can be added if observations of such use cases in the wild are reported.

Tuple attributes

First: an introduction to tuples for the uninitiated.

datahike-csv-loader supports the three kinds of tuples available in Dathaike (as in Datomic, for which the documentation just linked to is written): composite, heterogeneous, and homogeneous. They should be specified in a map, with each tuple attribute ident as a key with a vector of constituent attribute idents. For example, roughly working off this schema definition and supposing a CSV file with columns student/id, course/id, and semester/year+season:

(def col-schema {:tuple {:reg/semester+course+student
                         [:student/id :course/id :semester/year+season]}
                 ...})

And another, supposing a CSV file including columns station/lat and station/lon:

(def col-schema {:tuple {:station/coordinates [:station/lat :station/lon]}
                 ...})
})

As with attribute db/valueType in general, the db/valueType of tuple elements is inferred. The kind of each tuple itself is also inferred, using rules illustrated by the following clojure-ish pseudocode:

(if (tuple-ident (:unique-id col-schema))
  :composite
  (if (->> (tuple-ident (:tuple col-schema))
           (map valtype)
           (apply = ))
    :homogeneous
    :heterogeneous))

Note that the schema definitions of these tuple types imply that columns belonging to a composite tuple will be individually retained as attributes (with the tuple being automatically transacted by the database), while those belonging to other kinds will be subsumed into their respective tuples--for example, the data in columns station/lat and station/lon would be merged into the tuple attribute :station/coordinates, but the database would not contain the attributes :station/lat and :station/lon.

Current limitations

datahike-csv-loader currently:

  1. Assumes that the columns of any given CSV file represent attributes that do not yet exist in the database, i.e. it isn't possible to add data for existing attributes.
  2. Apart from any specification passed in col-schema to load-csv, automatically infers attribute schema, i.e. complete user specification of the schema isn't supported.
  3. Doesn't support batched loading.

We plan to address these shortcomings, and any others that arise, if they prove to be substantial.

License

Copyright © 2022 Yee Fay Lim.

Distributed under the Eclipse Public License version 1.0.

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close