Geni is designed primarily to be a good data-analysis tool that is optimised for frequent and rapid feedback from the data. The core design of Geni is informed by our personal experience working as a data scientist that requires asking many questions about the data and writing countless queries for it.
This is important when an idea randomly pops up, and we would like to know the answer here and now. The key here is to have a dataframe library that is accessible through a fast-starting REPL from any directory on the terminal. Geni's answer to this is the Geni CLI, which is essentially an executable script that starts a Clojure REPL (as well as an nREPL server), requires Geni namespaces and instantiates a SparkSession
in parallel.
With Clojure and Spark sub-optimal startup times, Geni is clearly handicapped compared to R and Python. On our machine, the startup times are as follows:
Library/Language | Startup Time (s) | Command |
---|---|---|
R | 0.2 | time bash -c "exit \| R --no-save" |
Python | 0.3 | time bash -c "exit \| ipython" |
Geni | 7.3 | time bash -c "exit \| geni" |
Spark Shell | 8.4 | time bash -c "echo sys.exit \| spark-shell" |
It is clearly not as fast-starting as R and Python, but it is still good to use for sub-one-minute tasks. To illustrate this, suppose that we are working with the Melbourne housing dataset stored in data/melbourne.parquet
, and we would like to know which region has the highest mean house price. Consider the following Python and Clojure snippets:
Python-Pandas | Clojure-Geni |
---|---|
$ ipython ... In [1]: import pandas as pd |
$ geni ... geni-repl (user) λ (def df (g/read-parquet! "data/melbourne.parquet")) #'user/df geni-repl (user) λ (-> df (g/group-by :Regionname) (g/agg {:price (g/mean :Price)}) (g/sort :price) g/show) +--------------------------+------------------+ |Regionname |price | +--------------------------+------------------+ |Western Victoria |397523.4375 | |Northern Victoria |594829.268292683 | |Eastern Victoria |699980.7924528302 | |Western Metropolitan |866420.5200135686 | |Northern Metropolitan |898171.0822622108 | |South-Eastern Metropolitan|922943.7844444445 | |Eastern Metropolitan |1104079.6342624065| |Southern Metropolitan |1372963.3693290735| +--------------------------+------------------+ ... |
After timing a personal run, the Python-Pandas version took 24 seconds, whereas the Clojure-Geni version took 34 seconds. The Python-Pandas combination has a small edge for sub-one-minute tasks, but the Clojure-Geni combination has all the Clojure REPL facilities including tight text-editor integrations. These make for a better REPL experience for bigger tasks.
One downside to the Python-Pandas combination is that the latter is single-threaded. This means that Pandas performance is very slow compared to other libraries for easily parallelisable tasks. To illustrate this point, consider the dummy retail data with 24 million transactions and over one million customers. Suppose that we would like to know how many transactions do the top brands have:
Python-Pandas | Clojure-Geni |
---|---|
$ ipython ... In [1]: import pandas as pd |
$ geni ... λ (time (-> (g/read-parquet! "data/dummy_retail") (g/select "brand-id") g/value-counts g/show)) +--------+------+ |brand-id|count | +--------+------+ |0 |238757| |1 |236314| |2 |233277| |3 |231845| |4 |229180| |5 |226255| |6 |225069| |7 |222698| |8 |220850| |9 |217840| +--------+------+ "Elapsed time: 3447.6941 msecs" ... |
In this case, we see around 3.7x performance for a very simple query. However, for more substantial queries, the speedups are typical greater - even up to 73x. See the simple performance benchmark post for a more detailed treatment.
First Class Clojure, First Class Spark
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close