CSV parsing now supports :comment-char that defaults to #. Lines that begin with this character are ignored.
Fix for issue 304 - n-initial-skip-rows not respected when parsing a csv file.
Experimental fix for issue 305 - replace-missing with :down or :up should leave values missing when the initial replacement fails instead of trying the opposite direction. This may leave datasets with some missing values.
Switched to the new csv processing system in dtype-next for parsing csv's. This eliminates
a source of more or less unfixable issues regarding univocity and it should be nearly
identical in performance while using less memory.
processing that efficiently allows you to load a CSV into a sequence of datasets based
on row-counts.
The univocity-based processing system will still be kept around as there may be files
that load significantly faster or that load correctly with the univocity processing
system.
Upgrade to dtype-next to make (ds/filter-column ds col identity) consistent w/r/t missing
values across numeric and object datatypes.
drop-missing has a 2-arg variant that takes a dataset and column name. This is a much
faster pathway than (ds/filter-column ds col identity) for dropping missing values.
New print options and bug fix for issue 266 - printing
first style of first ... last is the default as I think it is generally more useful than just first or last.
Skipped a version due to bug in this system.
issue 295 - new-column exported from api had
incorrect signature.
issue 294 - arrow files with lz4 dependent-block
encoding fail for the jpoinz decoder. The only sane resolution here is to use the C lz4 library decoding system
while we work through these issues upstream.
Support for reading/writing csv, tsv, edn, json bzip2 and zip files. Zip files
are only read when there is a single zipentry in them. bzip2 requires the user
to require tech.v3.dataset.bzip2 in order to work. See namespace documentation.
column-map is no longer lazy when an explicit datatype is provided. The result
is now generated immediately in parallel. Laziness can be achieved via the
dtype-next emap api along with assoc.
Defaulting :strings-as-text? to false for the multiple dataset pathway as
support for delta dictionaries was only recently solidified in the Arrow SDK
itself.
Fixing issue 287 - dataset corrupt after
nippy serialization. This had of course nothing to do with nippy but was caused by a bug in
dataset->data pathway.
Non-backward-compatible Fixes to rolling API's :comp-fn optional argument - the
parameters to the function are reversed so that things like clojure.core/- work.
Small upgrade to dtype-next with a more flexible new-array-of-structs definition
and documentation.
See unit tests
for how to convert an array of structs into a dataset.
Disable automatic file-backed-text because mmap is broken on m-1 macs. The fix for
this is moving to JDK-17, btw, where it is a normal API call that works fine. Don't
expect this to come back.
Intermediate versions before this - Lots of micro optimizations to row-mapcat and
some to micro-opts to group-by-column-agg.
Support for LocalTime datatype. Parquet and Arrow support this conversion. Arrow files
will read localtime back in as the datatype :time-microseconds. Users can use
:local-time as a parser datatype and there is support for parsing some simple
variations of local-time data.
row-mapcat
Row mapcat has an option to produce a sequence of datasets. This flows naturally into
group-by-column-agg. Keep this in mind as it keeps the size of the working set in memory
fairly low.
Main api namespaces are code-generated to ensure discoverability. Namespaces affected are
tech.v3.datatype, tech.v3.datatype.functional, tech.v3.datatype.datetime, tech.v3.dataset,
tech.v3.dataset.metamorph.
clj-kondo bindings and mostly clean linting pass failing only on 2 places on my dev machine.
Working with borkdude to deal with small number of current failings.
Upgrade to latest dtype-next - fix for ternary <,<=,>,>= in dfn namespace.
dtype-next's main api now includes efficient in-place reverse.
reverse-rows - reverse the order of the rows of the dataset.
select-missing - select only rows where one of the columns has a missing value.
The high performance aggregations in the reduce namespace now support a specialized
filter argument to filter out a row index very late in the process.
min-n-by-column - Find the mininum N rows by column - uses guava minmaxheap under the covers.
Sorting the result of this is an efficient way to find have a sorted top-N-type operation.
Changed the default concatenation pathway to be copying by default. This often times just
works better and results in much faster processing pipelines.
Fixed a few issues with packed datatypes.
Changed extend-column-with-empty so that it copies data. I am less sure about this
change but it fixed an issue with packed datatypes at the cost that joins are often
no longer in place. So if you get OOM errors now doing certain joins this change
is the culprit and we should back off and set it back to what it was.
row-map - map a function across the rows of the dataset (represented as maps). The result
should itself be a map and the dataset created from these maps will be merged back into
the original ds.
tech.v3.dataset.reductions/group-by-column-agg can take a tuple of column names in addition
to a single column name. In the case of a tuple the grouping will be the vector of column
values evaluated in object space (so missing will be nil).
major fix for odd? event? etc. in tech.v3.datatype.functional.
head,tail can accept numbers larger than row-count.
dtype-next tech.v3.datatype.functional namespace now has vectorized versions of
sum, dot-product, magnitude-squared, and distance that it will use if the input
is backed by a double array and if jdk.incubator.vector module is enabled.
New accessors - rows, row-at - both work in sequence-of-maps space. -1 indexes for
row-at return data indexed from the end so (row-at ds -1) returns the last dataset
row.
When accessing columns via ifn interface - (col idx), negative numbers index from
the end so for instance -1 retrieves the last value in the column.
Large and potentially destabilizing optimization in some cases where argops/argfilter can
return a range if the filtered region is contiguous and
then new columns are sitting on sub-buffers of other columns as opposed to indexed-buffers.
A sub-buffer doesn't pay the same indexing costs and is still capable of accessing the
underlying data (such as a double array) whereas an indexed buffer cannot faithfully return
the underlying data. This can dramatically reduce indexing costs for certain operations
and allows System/arraycopy and friends to be used for further operations.
Parquet documentation to address logging slowdown. If writing parquet files is
unreasonably slow then please read the documentation on logging. The java
parquet implementations logs so much it slows things down 5x-10x.
Data types and missing values are much more aggressively inferred - which is
O(n-rows)) throught the api. There is a new API to disable the inference - Either
pass something that is already a column or pass in a map with keys:
Put another way, the input to #{assoc ds/update-column ds/add-column ds/add-or-update-column} is already a column
(see tech.v3.dataset.column/new-column) or if :tech.v3.dataset/force-datatype? is
true and:tech.v3.dataset/data is convertible to a reader then the data will not
be scanned for datatype or missing values. If the input data is a primitive-typed
container then it will be scanned for missing values alone and anything else is
passed through the object parsing system which is what is used for sequences of maps,
maps of sequences and spreadsheets.
In this way in general the system will do more work than before - more scans of the
result of things like transducer pathways and persistent vectors but in return the
dataset's column datatypes should match the user's expectations. If too much time is
being taken up via attempting to infer datatypes and missing sets then the user has
the option to pass in explicitly constructed columns or column data representations
both of which will disable the scanning. Once the data is typed elementwise
mathematical operations of the type in :tech.v3.datatype.functionwill not
result in further scans the data.
Itemized Changes:
assoc, ds/add-column, ds/update-column, ds/add-or-update-column type operations all
upgraded such that datatype and missing are inferred much more frequently.
column-map - Now scans results to infer datatype if not provided as opposed to assuming result is the
widest of the input column types. Also users can provide their own function that calculates missing sets as opposed to
the default behavior being the union of the input columns' missing sets.
Issue 233 - Poi xlsx parser can now autodetect dates. Note that fastexcel is the default
xslx parser so in order to parse xlsx files using poi use tech.v3.libs.poi/workbook->datasets.
PR 232 - Option - :disable-comment-skipping? - to disable comment skipping in csv files.
Using builder model for parquet both for forward compatibility and so we can set an output stream
as opposed to a file path. This allows a graal native pathway to work wtih parquet.
Graal-native friendly mmap pathways (no requiring resolve, you have to explicity set the implementation in your main.clj file).
Parquet write pathway update to make more standard and more likely to work with future versions of parquet. This means, however, that there will
no longer be a direct correlation between number of datasets and number of record batches in a parquet file as the standard pathway takes care
of writing out record batches when a memory constraint is triggered. So if you save a dataset you may get a parquet file back that contains
a sequence of datasets. There are many parquet options, see the documentation for
ds-seq->parquet.
All statistical/reduction summations now use Kahan's compensated summation. This makes summation
much more accurate for very large streams of data.
Issue 220 - confusing behavior on dataset creation. This may result in different
behavior than was expected previously when using maps of columns as dataset constructors.
tech.v3.dataset.reductions
namespace now includes direct aggregations
including group-by aggregations and also t-dunnings' t-digest algorithm for
probabilistic cdf and quantile estimation.
group-by now is done with a linkedhashmap thus the keys are ordered in terms
of first found in the data. This is useful for operations such as re-indexing
a previously sorted dataset as it.
Added a large-dataset reduction namespace: tech.ml.dataset.reductions. Current
very beta but in general large reductions will reduce to java streams as these have
parallelization possibilities that sequences do not have; for instance you can
get a parallel stream out of a hash map.
tech.libs.arrow/write-dataset-seq-to-stream! - Given a sequence of datasets, write
an arrow stream with one record-batch for each dataset.
tech.libs.arrow/stream->dataset-seq-copying - Given an arrow stream, return a
sequence of datasets, one for each arrow data record.
tech.libs.arrow/stream->dataset-seq-inplace - Given an arrow stream, return a
sequence of datasets constructed in-place on memory mapped data. Expects to be
used with in a tech.resource/stack-resource-context but accepts options for
tech.v2.datatype.mmap/mmap-file.
tech.libs.arrow/visualize-arrow-stream - memory-maps a file and returns the arrow
structure in a way that prints nicely to the REPL. Useful for exploring an arrow
file and quickly seeing the low level structure.
tech.ml.dataset/csv->dataset-seq - Given a potentially large csv, parse it into
a sequence of datasets. These datasets are guaranteed to share a schema and so
an efficient form of writing really large arrow files is to using this function
along with tech.libs.arrow/write-dataset-seq-to-stream!.
Proper arrow support. In-place or accelerated copy pathway into the jvm.
'tech.libs.arrow` exposes a few functions to dive through arrow files and
product datasets. Right now only stream file format is supported.
Copying is supported via their blessed API. In-place is supported by
a more or less clean room implementation using memory mapped files. There
will be a blog post on this soon.
tech.datatype has a new namespace, tech.v2.datatype.mmap that supports memory
mapping files and direct memory access for address spaces (and files) larger
than the java nio 2GB limit for memory mapping and nio buffers.
Issue 116 - tech.ml.dataset/fill-range-replace - Given a numeric or date column,
interpolate column such that differences between successive vaules are smaller
than a given cutoff. Use replace-missing functionality on all other columns
to fill in values for generated rows.
Issue 115 - tech.ml.dataset/replace-missing Subset of replace-missing from
tablecloth implemented.
Datasets implement IPersistentMap. This changes the meaning of (seq dataset)
whereas it used to return columns it now returns sequences of map entries.
It does mean, however, that you can destructure datasets in let statements to
get the columns back and use clojure.core/[assoc,dissoc], contains? etc.
Some of the core Clojure functions, such as select-keys, will change your dataset
into a normal clojure persistent map so beware.
There is a new parse type: :encoded-text. When read, this will appear to be a
string column however the user has a choice of encodings and utf-8 is the default.
This is useful when you need a particular encoding for a column. It is roughly
twice as efficient be default as a normal string encoding (utf-8 vs. utf-16).
nth, map on packed datetime columns (or using them as functions)
returns datetime objects as opposed to their packed values. This means that if you
ask a packed datetime column for an object reader you get back an unpacked value.
Better support of nth. Columns cache the generic reader used for nth queries
and all tech.v2.datatype readers support nth and count natively in base java
interface implementations.
New namespace - tech.ml.dataset.text.bag-of-words that contains code to convert
a dataset with a text field into a dataset with document ids and and a
document-id->token-idx dataset.
Include logback-classic as a dependency as smile.math brings in slf4j and this causes
an error if some implementation of slf4j isn't included thus breaking things like
cljdoc.
Experimental options options for parsing text (:encoded-text) when dealing with
large text fields.
Bugfix release - We now do not ever parse to float32 numbers by default. This was
silently causing data loss. The cost of this is that files are somewhat larger and
potentially we need to have an option to set the default sequence of datatypes
attempted during data parsing.
Added concat-copying. This is much faster when you want to concatenate many
things at the cost of copying the data and thus potentially increasing the working
set size in memory.
Saving to nippy is much faster because there is a new function to efficiently
construct a string table from a reader of strings:
tech.ml.dataset.string-table/string-table-from-strings.
Issue-94 - Ragged csv data loads automatically now.
Issue-87 - Printing double numbers is much better.
Fixed saving tsv files - was writing out csv files.
Fixed writing packed datatypes - was writing integers.
Added parallelized loading of csv - helps a bit but only if parsing
is really expensive, so only when lots of datetime types or something
of that nature.
tech.datatype now supports persistent vectors made via clojure.core.vector-of.
vector-of is a nice middle ground between raw persistent vectors and java arrays
and may be a simple path for many users into typed storage and datasets.
Upgraded smile to latest version (2.4.0). This is a very new API so if
you are relying transitively on smile via dataset this may have broke your
systems. Smile 1.4.X and smile 2.X are very different interfaces so this
is important to get in before releasing a 2.0 version of dataset.
There is now an efficient conversion to/from smile dataframes.
->dataset conversion a smile dataframe to a dataset.
dataset->smile-dataframe conversion a dataset to a smile dataframe.
Columns that are reader based will be copied into java arrays. To enable
predictable behavior a new function was added.
ensure-array-backed - ensure each column in the dataset has a zerocopy
conversion to a java array enabled by tech.v2.datatype/->array.
invert-string->number - The pipeline function string->number stores a string
table in the column metadata. Using this metadata, invert the string->number
operation returning the column back to its original state. This metadata is
:label-map which is a map from column-data to number.
UUIDs are now supported as datatypes. This includes parsing them from strings
out of csv and xlsx files and as a fully supported object in mapseq pathways.
tech.v2.datatype was causing double-read on boolean readers.
issue-72 - added max-num-columns because csv and tsv files with more than 512
columns were failing to parse. New default is 8192.
issue-70 - The results of any join have two maps in their metadata -
:left-column-names - map of original left-column-name->new-column-name.
:right-column-names - map of original right-column-name->new-column-name.
issue-67 - Various tech.v2.datatype.functional functions are updated to be
more permissive about their inputs and cast the result to the appropriate
datatype.
issue-65 - datetimes in mapseqs were partially broken.
tech.v2.datatype.functional will now change the datatype appropriately on a
lot of unary math operations. So for instance calling sin, cos, log, or log1p
on an integer reader will now return a floating point reader. These methods used
to throw.
subtle bug in the ->reader method defined for object arrays meant that sometimes
attempting math on object columns would fail.
tech.ml.dataset/column-cast - Changes the column datatype via a an optionally
privided cast function. This function is powerful - it will correctly convert
packed types to their string representation, it will use the parsing system on
string columns and it uses the same complex datatype argument as
tech.ml.dataset.column/parse-column:
user> (doc ds/column-cast)
-------------------------
tech.ml.dataset/column-cast
([dataset colname datatype])
Cast a column to a new datatype. This is never a lazy operation. If the old
and new datatypes match and no cast-fn is provided then dtype/clone is called
on the column.
colname may be a scalar or a tuple of [src-col dst-col].
datatype may be a datatype enumeration or a tuple of
[datatype cast-fn] where cast-fn may return either a new value,
the :tech.ml.dataset.parse/missing, or :tech.ml.dataset.parse/parse-failure.
Exceptions are propagated to the caller. The new column has at least the
existing missing set if no attempt returns :missing or :cast-failure.:cast-failure means the value gets added to metadata key :unparsed-data
and the index gets added to :unparsed-indexes.
If the existing datatype is string, then tech.ml.datatype.column/parse-column
is called.
Casts between numeric datatypes need no cast-fn but one may be provided.
Casts to string need no cast-fn but one may be provided.
Casts from string to anything will call tech.ml.dataset.column/parse-column.
user> (defstocks (ds/->dataset"test/data/stocks.csv" {:key-fn keyword}))
#'user/stocks
user> (ds/head stocks)
test/data/stocks.csv [53]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (ds/head stocks)
test/data/stocks.csv [53]:
| :symbol | :date | :price |
|---------+------------+--------|
| MSFT | 2000-01-01 | 39.81 |
| MSFT | 2000-02-01 | 36.35 |
| MSFT | 2000-03-01 | 43.22 |
| MSFT | 2000-04-01 | 28.37 |
| MSFT | 2000-05-01 | 25.45 |
user> (take5 (stocks:price))
(39.8136.3543.2228.3725.45)
user> (take5 ((ds/column-cast stocks :price:string) :price))
("39.81""36.35""43.22""28.37""25.45")
user> (take5 ((ds/column-cast stocks :price [:int32 #(Math/round (double %))]) :price))
(4036432825)
user>
renamed 'column-map' to 'column-name->column-map'. This is a public interface change
and we do apologize!
added 'column-map' which maps a function over one or more columns. The result column
has a missing set that is the union of the input columns' missing sets:
[issue-64] - more tests revealed more problems with concat with different column
types.
added tech.v2.datatype/typed-reader-map where the result datatype is derived
from the input datatypes of the input readers. The result of map-fn is
unceremoniously coerced to this datatype -
Cleaned up the tech.datatype widen datatype code so it models a property type graph
with clear unification rules (where the parent are equal else :object).
[issue-64] - concat columns with different datatypes does a widening. In addition,
there are tested pathways to change the datatype of a column without changing the
missing set.
unroll-column takes an optional argument :indexes? that will record the source
index in the entry the unrolled data came from.
tech.v2.datatype namespace has a new function - make-reader - that reifies
a reader of the appropriate type. This allows you to make new columns that have
nontrivial translations and datatypes much easier than before.
tech.v2.datatype namespace has a new function - ->typed-reader - that typecasts the incoming object into a reader of the appropriate datatype.
This means that .read calls will be strongly typed and is useful for building up a set
of typed variables before using make-reader above.
Issue 52 - CSV columns with empty column names get named after their index. Before they would cause
an exception.
tech.datatype added a method
to transform a reader into a persistent-vector-like object that derives from
clojure.lang.APersistentVector and thus gains benefit from the excellent equality
and hash semantics of persistent vectors.
Fixed #57 - BREAKING PUBLIC API CHANGES - We are getting more strict on the API - if
a function is dataset-last (thus appropriate for ->>) then any options must be
passed before the dataset. Same is true for the set of functions that are dataset
first. We will be more strict about this from now on.
Parsing datetime types now works if the column starts with missing values.
An efficient formulation of java.util.map is introduced for when you have
a bitmap of keys and a single value:
tech.v2.datatype.bitmap/bitmap-value->bitmap-map. This is used for
replace-missing type operations.
brief now does not return missing values. Double or float NaN or INF values
from a mapseq result in maps with fewer keys.
Set of columns used for default descriptive stats is reduced to original set as
this fits on a small repl nicely. Possible to override. brief overrides this
to provide defaults to get more information.
unique-by returns indexes in order.
Fixed #51 - mapseq parsing now follows proper number tower.
Optimized filter. Record of optimization is on
zulip.
Synopsis is a speedup of like 10-20X depending on how much work you want to do :-).
The base filter pathway has a speedup of around 2-4X.
Updated description stats to provide list of distinct elements for categorical
columns of length less than 21.
Updated mapseq system to provide nil values for missing data as opposed to the
specific column datatype's missing value indicator. This can be overridden
by passing in :missing-nil? false as an option.
Added brief function to main namespace so you can get a nice brief description
of your dataset when working from the REPL. This prints out better than
descriptive-stats.
column->dataset - map a transform function over a column and return a new
dataset from the result. It is expected the transform function returns a map.
drop-rows, select-rows, drop-columns - more granular select calls.
append-columns - append a list of columns to a dataset. Used with column->dataset.
column-labeled-mapseq - Create a sequence of maps with a :value and :label members.
this flattens the dataset by producing Y maps per row instead of 1 map per row
where the maps themselves are labeled with the value in their :value member. This
is useful to building vega charts.
->distinct-by-column - take the first row where a given key is present. The arrow
form of this indicats the dataset is the first argument.
->sort-by, ->sort-by-column - Forms of these functions for using in (->)
dataflows.
interpolate-loess - Produce a new column from a given pair of columns using loess
interpolation to create the column. The interpolator is saved as metadata on the
new column.
Support for parsing and working with durations. Strings that look like times -
"00:00:12" will be parsed into hh:mm:ss durations. The value can have a negative
sign in front. This is in addition to the duration's native serialization string
type.
Added short test for tensors in datasets. This means that the venerable print-table
is no longer enough as it doesn't account for multiline strings and thus datatets
with really complex things will not print correctly for a time.
Various fixes related to parsing and working with open data.
tech.ml.dataset.column/parse-column - given a string column that failed to parse for
some reason, you can force the system to attempt to parse it using, for instance,
relaxed parsing semantics where failures simply record the failure in metadata.
relaxed parsing in general is supported across all input types.