Occasionally, we find functions from other languages and libraries that solve certain problems succinctly and elegantly, but are unfortunately absent in Spark. Geni aims to incorporate many such functions in order to speed up the feedback loop and improve the REPL experience.
This doc discusses a few interesting examples of such imported idioms. It is by no means exhaustive. See the cljdoc entry for the zero-one.geni.foreign-idioms namespace for the complete set.
Sometimes it's convenient to know the shape of a dataframe:
(g/shape dataframe)
=> [13580 21]
The function value-counts returns the counts of unique values. Unlike in Pandas, this operation does not require a single column:
(-> dataframe
(g/select :SellerG :Suburb)
g/value-counts
(g/limit 5)
g/show)
; +-------+--------------+-----+
; |SellerG|Suburb |count|
; +-------+--------------+-----+
; |Biggin |Richmond |101 |
; |Nelson |Essendon |97 |
; |Nelson |Brunswick |96 |
; |Barry |Reservoir |90 |
; |Buxton |Bentleigh East|88 |
; +-------+--------------+-----+
The function cut bins values into discrete intervals:
(-> dataframe
(g/select {:price-bins (g/cut :Price [8e5 1e6 1.2e6])})
g/value-counts
g/show)
; +---------------------------+-----+
; |price-bins |count|
; +---------------------------+-----+
; |Price[-Infinity, 800000.0] |5479 |
; |Price[1200000.0, Infinity] |4328 |
; |Price[800000.0, 1000000.0] |2358 |
; |Price[1000000.0, 1200000.0]|1415 |
; +---------------------------+-----+
At times, it may be convenient to do the binning based on the quantiles. Enter qcut:
(-> dataframe
(g/select {:price :Price
:bins (g/qcut :Price [0.1 0.5 0.9])})
(g/group-by :bins)
(g/agg {:min (g/min :price)
:mean (g/mean :price)
:max (g/max :price)
:count (g/count "*")})
g/show)
; +---------------+---------+------------------+---------+-----+
; |bins |min |mean |max |count|
; +---------------+---------+------------------+---------+-----+
; |Price[0.0, 0.1]|85000.0 |392448.1684981685 |480000.0 |1365 |
; |Price[0.1, 0.5]|480500.0 |697254.5108695652 |903000.0 |5428 |
; |Price[0.5, 0.9]|904000.0 |1270420.509825528 |1850000.0|5445 |
; |Price[0.9, 1.0]|1851000.0|2511148.2026825636|9000000.0|1342 |
; +---------------+---------+------------------+---------+-----+
Or simply just, say, five equal bins:
(-> dataframe
(g/select {:price :Price
:bins (g/qcut :Price 5)})
(g/group-by :bins)
(g/agg {:min (g/min :price)
:mean (g/mean :price)
:max (g/max :price)
:count (g/count "*")})
g/show)
; +---------------+---------+------------------+---------+-----+
; |bins |min |mean |max |count|
; +---------------+---------+------------------+---------+-----+
; |Price[0.0, 0.2]|85000.0 |469287.6683562636 |600000.0 |2762 |
; |Price[0.2, 0.4]|600500.0 |700734.7062937063 |800000.0 |2717 |
; |Price[0.4, 0.6]|800001.0 |911216.2482269504 |1040000.0|2679 |
; |Price[0.6, 0.8]|1041000.0|1240686.2796330275|1450000.0|2725 |
; |Price[0.8, 1.0]|1451000.0|2071079.9484612532|9000000.0|2697 |
; +---------------+---------+------------------+---------+-----+
NumPy's clip can sometimes be pretty handy:
(-> dataframe
(g/select {:land-size :LandSize
:clipped (g/clip :LandSize 100 200)})
(g/limit 10)
g/show)
; +---------+-------+
; |land-size|clipped|
; +---------+-------+
; |202.0 |200.0 |
; |156.0 |156.0 |
; |134.0 |134.0 |
; |94.0 |100.0 |
; |120.0 |120.0 |
; |181.0 |181.0 |
; |245.0 |200.0 |
; |256.0 |200.0 |
; |0.0 |100.0 |
; |220.0 |200.0 |
;+---------+-------+
Users of TechAscent's dataset may be more familiar with the ->dataset function for dataframe creation:
(-> [{:a 1 :b 2} {:a 2 :c 3}]
g/->dataset
g/show)
; +---+----+----+
; |a |b |c |
; +---+----+----+
; |1 |2 |null|
; |2 |null|3 |
; +---+----+----+
And it works with file paths and options:
(-> "test/resources/melbourne_housing_snapshot.parquet"
(g/->dataset {:n-records 5 :column-whitelist [:Price
:Suburb
:LandSize]})
g/show)
; +---------+----------+--------+
; |Price |Suburb |LandSize|
; +---------+----------+--------+
; |1480000.0|Abbotsford|202.0 |
; |1035000.0|Abbotsford|156.0 |
; |1465000.0|Abbotsford|134.0 |
; |850000.0 |Abbotsford|94.0 |
; |1600000.0|Abbotsford|120.0 |
; +---------+----------+--------+
Can you improve this documentation?Edit on GitHub
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |