Geni (/gɜni/ or "gurney" without the r) is a Clojure library that wraps Apache Spark. The name means "fire" in Javanese.
WARNING! This library is still unstable. Some information here may be outdated. Do not use it in production just yet! See Flambo and Sparkling for more mature alternatives.
Geni is designed to provide an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's ->
threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.
Docs:
Geni Cookbook:
All examples below use the Melbourne housing market data available for free on Kaggle.
Spark SQL API for data wrangling:
(require '[zero-one.geni.core :as g])
(g/count dataframe)
=> 13580
(g/print-schema dataframe)
; root
; |-- Suburb: string (nullable = true)
; |-- Address: string (nullable = true)
; |-- Rooms: long (nullable = true)
; |-- Type: string (nullable = true)
; |-- Price: double (nullable = true)
; |-- Method: string (nullable = true)
; |-- SellerG: string (nullable = true)
; |-- Date: string (nullable = true)
; |-- Distance: double (nullable = true)
; |-- Postcode: double (nullable = true)
; |-- Bedroom2: double (nullable = true)
; |-- Bathroom: double (nullable = true)
; |-- Car: double (nullable = true)
; |-- Landsize: double (nullable = true)
; |-- BuildingArea: double (nullable = true)
; |-- YearBuilt: double (nullable = true)
; |-- CouncilArea: string (nullable = true)
; |-- Lattitude: double (nullable = true)
; |-- Longtitude: double (nullable = true)
; |-- Regionname: string (nullable = true)
; |-- Propertycount: double (nullable = true)
(-> dataframe (g/limit 5) g/show)
; +----------+----------------+-----+----+---------+------+-------+---------+--------+--------+--------+--------+---+--------+------------+---------+-----------+---------+----------+---------------------+-------------+
; |Suburb |Address |Rooms|Type|Price |Method|SellerG|Date |Distance|Postcode|Bedroom2|Bathroom|Car|Landsize|BuildingArea|YearBuilt|CouncilArea|Lattitude|Longtitude|Regionname |Propertycount|
; +----------+----------------+-----+----+---------+------+-------+---------+--------+--------+--------+--------+---+--------+------------+---------+-----------+---------+----------+---------------------+-------------+
; |Abbotsford|85 Turner St |2 |h |1480000.0|S |Biggin |3/12/2016|2.5 |3067.0 |2.0 |1.0 |1.0|202.0 |null |null |Yarra |-37.7996 |144.9984 |Northern Metropolitan|4019.0 |
; |Abbotsford|25 Bloomburg St |2 |h |1035000.0|S |Biggin |4/02/2016|2.5 |3067.0 |2.0 |1.0 |0.0|156.0 |79.0 |1900.0 |Yarra |-37.8079 |144.9934 |Northern Metropolitan|4019.0 |
; |Abbotsford|5 Charles St |3 |h |1465000.0|SP |Biggin |4/03/2017|2.5 |3067.0 |3.0 |2.0 |0.0|134.0 |150.0 |1900.0 |Yarra |-37.8093 |144.9944 |Northern Metropolitan|4019.0 |
; |Abbotsford|40 Federation La|3 |h |850000.0 |PI |Biggin |4/03/2017|2.5 |3067.0 |3.0 |2.0 |1.0|94.0 |null |null |Yarra |-37.7969 |144.9969 |Northern Metropolitan|4019.0 |
; |Abbotsford|55a Park St |4 |h |1600000.0|VB |Nelson |4/06/2016|2.5 |3067.0 |3.0 |1.0 |2.0|120.0 |142.0 |2014.0 |Yarra |-37.8072 |144.9941 |Northern Metropolitan|4019.0 |
; +----------+----------------+-----+----+---------+------+-------+---------+--------+--------+--------+--------+---+--------+------------+---------+-----------+---------+----------+---------------------+-------------+
(-> dataframe (g/describe :Landsize :Rooms :Price) g/show)
; +-------+-----------------+------------------+-----------------+
; |summary|Landsize |Rooms |Price |
; +-------+-----------------+------------------+-----------------+
; |count |13580 |13580 |13580 |
; |mean |558.4161266568483|2.9379970544919 |1075684.079455081|
; |stddev |3990.669241109034|0.9557479384215565|639310.7242960163|
; |min |0.0 |1 |85000.0 |
; |max |433014.0 |10 |9000000.0 |
; +-------+-----------------+------------------+-----------------+
(-> dataframe
(g/group-by :Suburb)
(g/agg {:count (g/count "*")
:n-sellers (g/count-distinct :SellerG)
:avg-price (g/int (g/mean :Price))})
(g/order-by (g/desc :count))
(g/limit 5)
g/show)
; +--------------+-----+---------+---------+
; |Suburb |count|n-sellers|avg-price|
; +--------------+-----+---------+---------+
; |Reservoir |359 |18 |690008 |
; |Richmond |260 |22 |1083564 |
; |Bentleigh East|249 |21 |1085591 |
; |Preston |239 |20 |902800 |
; |Brunswick |222 |21 |1013171 |
; +--------------+-----+---------+---------+
(-> dataframe
(g/select {:address :Address
:date (g/to-date :Date "d/MM/yyyy")
:coord (g/struct {:lat :Lattitude :long :Longtitude})})
g/shuffle
(g/limit 5)
g/collect)
=> ({:address "114 Shields St",
:date #inst "2016-05-21T17:00:00.000-00:00",
:coord {:lat -37.7847, :long 144.9341}}
{:address "129 Glenlyon Rd",
:date #inst "2017-05-05T17:00:00.000-00:00",
:coord {:lat -37.7723, :long 144.9694}}
{:address "48 Lyons St",
:date #inst "2016-04-15T17:00:00.000-00:00",
:coord {:lat -37.8955, :long 145.0515}}
{:address "3/31 Clapham St",
:date #inst "2017-05-19T17:00:00.000-00:00",
:coord {:lat -37.7549, :long 144.9979}}
{:address "327 Hull Rd",
:date #inst "2017-09-08T17:00:00.000-00:00",
:coord {:lat -37.78329, :long 145.32271}})
Spark ML example translated from Spark's programming guide:
(require '[zero-one.geni.core :as g])
(require '[zero-one.geni.ml :as ml])
(def training-set
(g/table->dataset
[[0 "a b c d e spark" 1.0]
[1 "b d" 0.0]
[2 "spark f g h" 1.0]
[3 "hadoop mapreduce" 0.0]]
[:id :text :label]))
(def pipeline
(ml/pipeline
(ml/tokenizer {:input-col :text
:output-col :words})
(ml/hashing-tf {:num-features 1000
:input-col :words
:output-col :features})
(ml/logistic-regression {:max-iter 10
:reg-param 0.001})))
(def model (ml/fit training-set pipeline))
(def test-set
(g/table->dataset
[[4 "spark i j k"]
[5 "l m n"]
[6 "spark hadoop spark"]
[7 "apache hadoop"]]
[:id :text]))
(-> test-set
(ml/transform model)
(g/select :id :text :probability :prediction)
g/show)
;; +---+------------------+----------------------------------------+----------+
;; |id |text |probability |prediction|
;; +---+------------------+----------------------------------------+----------+
;; |4 |spark i j k |[0.1596407738787411,0.8403592261212589] |1.0 |
;; |5 |l m n |[0.8378325685476612,0.16216743145233883]|0.0 |
;; |6 |spark hadoop spark|[0.0692663313297627,0.9307336686702373] |1.0 |
;; |7 |apache hadoop |[0.9821575333444208,0.01784246665557917]|0.0 |
;; +---+------------------+----------------------------------------+----------+
More detailed examples can be found here.There is also a one-to-one walkthrough of Chapter 5 of NVIDIA's Accelerating Apache Spark 3.x, which can be found here.
Use Leiningen to create a template of a Geni project:
lein new geni <project-name>
Add the following to your project.clj
dependency:
You would also need to add Spark as provided dependencies. For instance, have the following key-value pair for the :profiles
map:
:provided
{:dependencies [;; Spark
[org.apache.spark/spark-avro_2.12 "3.0.0"]
[org.apache.spark/spark-core_2.12 "3.0.0"]
[org.apache.spark/spark-hive_2.12 "3.0.0"]
[org.apache.spark/spark-mllib_2.12 "3.0.0"]
[org.apache.spark/spark-sql_2.12 "3.0.0"]
[org.apache.spark/spark-streaming_2.12 "3.0.0"]
[com.github.fommil.netlib/all "1.1.2" :extension "pom"]
;; Databases
[mysql/mysql-connector-java "8.0.21"]
[org.postgresql/postgresql "42.2.14"]
[org.xerial/sqlite-jdbc "3.32.3.1"]
;; Optional: Spark XGBoost
[ml.dmlc/xgboost4j-spark_2.12 "1.0.0"]
[ml.dmlc/xgboost4j_2.12 "1.0.0"]]}
You may also need to install libatlas3-base
and libopenblas-base
to use a native BLAS, and install libgomp1
to train XGBoost4j models. When the optional dependencies are not present, the vars to the corresponding functions (such as ml/xgboost-classifier
) will be left unbound.
Copyright 2020 Zero One Group.
Geni is licensed under Apache License v2.0, see LICENSE.
Some code was taken from:
arg-count
.Can you improve this documentation? These fine people already did:
Anthony Khong & arithmoxEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close