Source: Clojurians Zulip #tech.ml.dataset.dev > "index structures in Columns - scope" thread
Date: Late 2024 (December)
Discussion about index structures for tablecloth.time, exploring whether we need tree-based structures (like pandas DateTimeIndex) or simpler approaches for time-based slicing and lookups. The conversation was prompted by two needs:
Quote:
"You only need tree structures if you are adding values ad-hoc or removing them - usually with datasets we aren't adding/removing rows but rebuilding the index all at once. Just sorting the dataset and using binary search will outperform most/all tree structures in this scenario as it is faster to sort than to construct trees. Binary search performs as well as any tree search for queries and range queries."
Rationale:
Performance validation (Harold):
"I'm actually getting surprisingly good performance out of
java.util.Collections/binarySearch+tech.v3.dataset/row-map, I can decorate the example dataset above in less than 1s (and the real-world dataset that has more than 1M rows also at a rate greater than 1M rows/s)."
For categorical/hash-based lookups (e.g., lookup by ID):
(defn obtain-index [ds colname]
(let [m (ds/group-by-column->indexes ds colname)]
(fn [v]
(map (partial ds/row-at ds) (get m v)))))
(def lookup (obtain-index ds :id))
(lookup "a") ; returns rows where :id == "a"
Properties:
group-by-column->indexes already exists in tech.ml.datasetFor ordered/temporal lookups (e.g., time slicing):
Approach:
(tc/select-rows ds (range start-idx end-idx))Key operations:
java.util.Collections/binarySearch or custom binary search for lower/upper bounds:sorted? true optimization hint)Based on Chris's guidance and Harold's validation:
What we need:
What we DON'T need:
index-by)Problem: 1M row time series dataset; for each row, find the index of the row closest to one week later.
Solution: Binary search per row: >1M rows/s performance using java.util.Collections/binarySearch
Takeaway: Binary search is fast enough even when doing 1M searches on 1M rows.
Harold discovered that pandas indexing has type inconsistencies - the type returned by a lookup varies based on the number of matching elements:
s = pd.Series(range(3), index=list("aab"))
type(s.loc["a"]) # <class 'pandas.core.series.Series'> (2 matches)
type(s.loc["b"]) # <class 'numpy.int64'> (1 match)
Similarly for DataFrames:
type(df.loc["a"]) # <class 'pandas.core.frame.DataFrame'> (multiple matches)
type(df.loc["b"]) # <class 'pandas.core.series.Series'> (single match)
Our decision: Avoid this inconsistency. Our indexing functions should return consistent types.
Harold noted:
"I think indexing in particular is a bit of a sprawling topic, and there isn't yet a great consensus on what to do for it from a functional data science perspective."
"I have been reading Pandas' source in this area, and it's a bit bonkers. I would estimate there is about as much code in Pandas doing indexing stuff as there is all of the code in TMD"
The removal was part of simplification - keeping tech.ml.dataset focused on core functionality, with extensions built outside.
Daniel suggested an advanced technique for searching multiple sorted values in a sorted dataset simultaneously, where searchers "learn from each other":
Papers mentioned:
Concept: When searching for many sorted values (e.g., t+week for all t), maintain multiple binary search ranges that collapse when pointers collide. Could be beneficial when values are dense.
Status: Interesting optimization for future, but standard binary search is sufficient for now.
jsa mentioned interval trees as potentially useful for grouping/considering intervals as "the same."
Harold mentioned Postgres BRIN indexes (Block Range Index) as useful for time series:
"For time series we've had great experience w/ Postgres' BRIN indexes"
Status: Not needed for initial implementation given binary search performance.
Ethan clarified what tablecloth.time was about:
The indexing mechanism was secondary; the real value was making time operations concise and idiomatic.
Harold outlined a pragmatic approach to feature development:
"Our policy is always to implement the first version of something inside a specific leaf (client) project that needs it. Then, if that thing is later needed elsewhere, it can be lifted into a library that those projects share."
"What I'd really like to see is examples of powerful uses of indexing in Pandas that lead to enviable solutions to actual problems. Then we can look at those and see what features they imply we need."
Implication for tablecloth.time: Start simple (explicit column args, binary search), iterate based on real use cases.
Will Cohen was working on geospatial index structures in parallel. Discussion touched on:
:geometry vs :object)add-object-datatype!satisfies? (Chris: "unnecessarily slow")Relevance: Demonstrates dtype-next's extensibility for custom spatial types, but not directly applicable to time indexing.
Pandas approach:
Our approach (tablecloth.time):
(slice-by ds :time-col start end)tech.v3.datatype.argops/arggroupds/group-by-column->indexes, ds/row-atjava.util.Collections/binarySearch and java.util.Arrays/binarySearchBased on the discussion, here's what we should implement:
index-by) per tablecloth consistency{:sorted? true} optimization hintjava.util.Collections/binarySearch) or write custom lower/upper bound searchds/group-by-column->indexes directly)Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |