dtype-next
.tech.v3
across the board as opposed to tech.v2
along
with tech.ml
.(tech.ml.dataset.utils/set-slf4j-log-level :info)
```.
tech.ml.dataset.reductions
. Current
very beta but in general large reductions will reduce to java streams as these have
parallelization possibilities that sequences do not have; for instance you can
get a parallel stream out of a hash map.tech.ml.dataset/csv->dataset-seq
- Fixed input stream closing before sequence is
completely consumed.tech.libs.arrow/write-dataset-seq-to-stream!
- Given a sequence of datasets, write
an arrow stream with one record-batch for each dataset.tech.libs.arrow/stream->dataset-seq-copying
- Given an arrow stream, return a
sequence of datasets, one for each arrow data record.tech.libs.arrow/stream->dataset-seq-inplace
- Given an arrow stream, return a
sequence of datasets constructed in-place on memory mapped data. Expects to be
used with in a tech.resource/stack-resource-context
but accepts options for
tech.v2.datatype.mmap/mmap-file
.tech.libs.arrow/visualize-arrow-stream
- memory-maps a file and returns the arrow
structure in a way that prints nicely to the REPL. Useful for exploring an arrow
file and quickly seeing the low level structure.tech.ml.dataset/csv->dataset-seq
- Given a potentially large csv, parse it into
a sequence of datasets. These datasets are guaranteed to share a schema and so
an efficient form of writing really large arrow files is to using this function
along with tech.libs.arrow/write-dataset-seq-to-stream!
.tech.ml.dataset/fill-range-replace
- Given a numeric or date column,
interpolate column such that differences between successive vaules are smaller
than a given cutoff. Use replace-missing functionality on all other columns
to fill in values for generated rows.tech.ml.dataset/replace-missing
Subset of replace-missing from
tablecloth implemented.IPersisentCollections
.(seq dataset)
whereas it used to return columns it now returns sequences of map entries.
It does mean, however, that you can destructure datasets in let statements to
get the columns back and use clojure.core/[assoc,dissoc], contains? etc.
Some of the core Clojure functions, such as select-keys, will change your dataset
into a normal clojure persistent map so beware.:encoded-text
. When read, this will appear to be a
string column however the user has a choice of encodings and utf-8 is the default.
This is useful when you need a particular encoding for a column. It is roughly
twice as efficient be default as a normal string encoding (utf-8 vs. utf-16).nth
, map
on packed datetime columns (or using them as functions)
returns datetime objects as opposed to their packed values. This means that if you
ask a packed datetime column for an object reader you get back an unpacked value.nth
. Columns cache the generic reader used for nth queries
and all tech.v2.datatype readers support nth and count natively in base java
interface implementations.tech.ml.dataset.text.bag-of-words
that contains code to convert
a dataset with a text field into a dataset with document ids and and a
document-id->token-idx dataset.left-join-asof
- Implementation of algorithms from pandas'
`merge_asof'.concat-copying
. This is much faster when you want to concatenate many
things at the cost of copying the data and thus potentially increasing the working
set size in memory.tech.ml.dataset.string-table/string-table-from-strings
.tech.datatype
now supports persistent vectors made via clojure.core.vector-of
.
vector-of
is a nice middle ground between raw persistent vectors and java arrays
and may be a simple path for many users into typed storage and datasets.->dataset
breaking changes
There is now an efficient conversion to/from smile dataframes.
->dataset
conversion a smile dataframe to a dataset.dataset->smile-dataframe
conversion a dataset to a smile dataframe.
Columns that are reader based will be copied into java arrays. To enable
predictable behavior a new function was added.ensure-array-backed
- ensure each column in the dataset has a zerocopy
conversion to a java array enabled by tech.v2.datatype/->array
.invert-string->number
- The pipeline function string->number
stores a string
table in the column metadata. Using this metadata, invert the string->number
operation returning the column back to its original state. This metadata is
:label-map which is a map from column-data to number.dtype/clone
works correctly for arrays.tech.ml.dataset.column/scan-data-for-missing
fix.dataset->str
.tech.ml.dataset.column/stats
was wrong for columns with missing
values.tech.v2.datatype
was causing double-read on boolean readers.max-num-columns
because csv and tsv files with more than 512
columns were failing to parse. New default is 8192.n-initial-skip-rows
works with xlsx spreadsheets.assoc
, dissoc
implemented in the main dataset namespace.tech.v2.datatype.functional
functions are updated to be
more permissive about their inputs and cast the result to the appropriate
datatype.tech.v2.datatype.functional
will now change the datatype appropriately on a
lot of unary math operations. So for instance calling sin, cos, log, or log1p
on an integer reader will now return a floating point reader. These methods used
to throw.tech.ml.dataset/column-cast
- Changes the column datatype via a an optionally
privided cast function. This function is powerful - it will correctly convert
packed types to their string representation, it will use the parsing system on
string columns and it uses the same complex datatype argument as
tech.ml.dataset.column/parse-column
:user> (doc ds/column-cast)
-------------------------
tech.ml.dataset/column-cast
([dataset colname datatype])
Cast a column to a new datatype. This is never a lazy operation. If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.
colname may be a scalar or a tuple of [src-col dst-col].
datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
the :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
Exceptions are propagated to the caller. The new column has at least the
existing missing set if no attempt returns :missing or :cast-failure.
:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.
If the existing datatype is string, then tech.ml.datatype.column/parse-column
is called.
Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.ml.dataset.column/parse-column.
user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (ds/head stocks)
test/data/stocks.csv [5 3]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (take 5 (stocks :price))
(39.81 36.35 43.22 28.37 25.45)
user> (take 5 ((ds/column-cast stocks :price :string) :price))
("39.81" "36.35" "43.22" "28.37" "25.45")
user> (take 5 ((ds/column-cast stocks :price [:int32 #(Math/round (double %))]) :price))
(40 36 43 28 25)
user>
user> (-> (ds/->dataset [{:a 1} {:b 2.0} {:a 2 :b 3.0}])
(ds/column-map
:summed
(fn ^double [^double lhs ^double rhs]
(+ lhs rhs))
:a :b))
_unnamed [3 3]:
| :a | :b | :summed |
|----+-------+---------|
| 1 | | |
| | 2.000 | |
| 2 | 3.000 | 5.000 |
user> (tech.ml.dataset.column/missing
(*1 :summed))
#{0,1}
tech.v2.datatype/typed-reader-map
where the result datatype is derived
from the input datatypes of the input readers. The result of map-fn is
unceremoniously coerced to this datatype -user> (-> (ds/->dataset [{:a 1.0} {:a 2.0}])
(ds/update-column
:a
#(dtype/typed-reader-map (fn ^double [^double in]
(if (< in 2.0) (- in) in))
%)))
_unnamed [2 1]:
| :a |
|--------|
| -1.000 |
| 2.000 |
unroll-column
takes an optional argument :indexes?
that will record the source
index in the entry the unrolled data came from..addAll
tech.datatype
- all readers are marked as sequential.unroll-column
- Given a column that may container either iterable or scalar data,
unroll it so it only contains scalar data duplicating rows.tech.ml.dataset.column/unique
and especially
`tech.ml.dataset.pipeline/string->number.tech.v2.datatype
namespace has a new function - make-reader - that reifies
a reader of the appropriate type. This allows you to make new columns that have
nontrivial translations and datatypes much easier than before.tech.v2.datatype
namespace has a new function - ->typed-reader - that typecasts the incoming object into a reader of the appropriate datatype.
This means that .read calls will be strongly typed and is useful for building up a set
of typed variables before using make-reader
above.tech.datatype
added a method
to transform a reader into a persistent-vector-like object that derives from
clojure.lang.APersistentVector
and thus gains benefit from the excellent equality
and hash semantics of persistent vectors.columnwise-concat
which is a far simpler version of dplyr's
https://tidyr.tidyverse.org/reference/pivot_longer.html. This is implemented
efficiently in terms of indexed reader concatentation and as such should work
on tables of any size.->>
) then any options must be
passed before the dataset. Same is true for the set of functions that are dataset
first. We will be more strict about this from now on.tech.v2.datatype.bitmap/bitmap-value->bitmap-map
. This is used for
replace-missing type operations.brief
now does not return missing values. Double or float NaN or INF values
from a mapseq result in maps with fewer keys.brief
overrides this
to provide defaults to get more information.unique-by
returns indexes in order.->>
operators.tech.datatype
with upgraded and fewer dependencies.
:missing-nil?
false as an option.brief
function to main namespace so you can get a nice brief description
of your dataset when working from the REPL. This prints out better than
descriptive-stats
.->
versions of sort added so you can sort in -> pathwayscolumn->dataset
- map a transform function over a column and return a new
dataset from the result. It is expected the transform function returns a map.drop-rows
, select-rows
, drop-columns
- more granular select calls.append-columns
- append a list of columns to a dataset. Used with column->dataset.column-labeled-mapseq
- Create a sequence of maps with a :value and :label members.
this flattens the dataset by producing Y maps per row instead of 1 map per row
where the maps themselves are labeled with the value in their :value member. This
is useful to building vega charts.->distinct-by-column
- take the first row where a given key is present. The arrow
form of this indicats the dataset is the first argument.->sort-by
, ->sort-by-column
- Forms of these functions for using in (->)
dataflows.interpolate-loess
- Produce a new column from a given pair of columns using loess
interpolation to create the column. The interpolator is saved as metadata on the
new column.tech.ml.dataset.column/parse-column
- given a string column that failed to parse for
some reason, you can force the system to attempt to parse it using, for instance,
relaxed parsing semantics where failures simply record the failure in metadata.Can you improve this documentation? These fine people already did:
Chris Nuernberger & HaroldEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close