Data cleaning and exploration utilities for datajure. Standalone functions that operate on datasets directly and thread naturally. Not part of dt — these complement it for common data preparation tasks.
Data cleaning and exploration utilities for datajure. Standalone functions that operate on datasets directly and thread naturally. Not part of dt — these complement it for common data preparation tasks.
(clean-column-names dataset)Clean column names: lowercase, replace spaces/special chars with hyphens, collapse consecutive hyphens, strip leading/trailing hyphens. "Some Ugly Name!" → :some-ugly-name
Clean column names: lowercase, replace spaces/special chars with hyphens, collapse consecutive hyphens, strip leading/trailing hyphens. "Some Ugly Name!" → :some-ugly-name
(coerce-columns dataset col-type-map)Bulk type coercion. col-type-map is {col-kw datatype-kw ...}. Example: (coerce-columns ds {:year :int64 :mass :float64})
Bulk type coercion. col-type-map is {col-kw datatype-kw ...}.
Example: (coerce-columns ds {:year :int64 :mass :float64})(describe dataset)(describe dataset cols)Descriptive statistics for dataset columns. Returns a dataset with one row per column: :column, :datatype, :n, :n-missing, :mean, :sd, :min, :p25, :median, :p75, :max. Non-numeric columns show nil for stats. Optional second arg selects columns (vector of keywords).
Descriptive statistics for dataset columns. Returns a dataset with one row per column: :column, :datatype, :n, :n-missing, :mean, :sd, :min, :p25, :median, :p75, :max. Non-numeric columns show nil for stats. Optional second arg selects columns (vector of keywords).
(drop-constant-columns dataset)Remove columns where all values are identical (zero variance). Note: columns with 0 or 1 rows are always kept — a single observation has no variance by definition, but that does not mean the column is constant across observations.
Remove columns where all values are identical (zero variance). Note: columns with 0 or 1 rows are always kept — a single observation has no variance by definition, but that does not mean the column is constant across observations.
(duplicate-rows dataset)(duplicate-rows dataset cols)Returns dataset of duplicate rows only. Optional second arg specifies subset of columns to check for duplicates.
Returns dataset of duplicate rows only. Optional second arg specifies subset of columns to check for duplicates.
(mark-duplicates dataset)(mark-duplicates dataset cols)Adds :duplicate? boolean column. Optional second arg specifies subset of columns to check for duplicates.
Adds :duplicate? boolean column. Optional second arg specifies subset of columns to check for duplicates.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |