Liking cljdoc? Tell your friends :D

Continuous Integration Code Coverage Clojars Project

WARNING! This library is still unstable. Some information here may be outdated. Do not use it in production just yet!

See Flambo and Sparkling for more mature alternatives.

Introduction

geni (/gɜni/ or "gurney" without the r) is a Clojure library that wraps Apache Spark. The name comes from the Javanese word for fire.

Why?

This question is probably not directed at the choice of Spark, because it is fairly easy to justify choosing Spark due to its maturity, speed and pleasant API. Rather, why wrap Spark in Clojure when you can use Spark natively in Scala or its popular Python API, PySpark?

Clojure is an excellent programming language for data wrangling due to its particular focus on fast feedbacks - most notably through its REPL. Being hosted on the JVM, Clojure interoperates well with Java (and thus Scala) libaries. However, Spark's pleasant API in Scala becomes quite clunky in Clojure. Geni aims to provide an ergonomic Spark interface for the Clojure REPL.

An example of such nuisance is having to wrap column names inside a Java array of Spark columns:

(-> dataframe
    (.groupBy "SellerG" (into-array java.lang.String ["Suburb"]))
    (.agg
      (.as (functions/mean "Price") "mean")
      (into-array Column [(.as (functions/stddev "Price") "std")
                          (.as (functions/min "Price") "min")
                          (.as (functions/max "Price") "max")]))
    .show)

Geni aims to provide a Spark interface that plays nice with Clojure's threading macro -> and dynamic types:

(-> dataframe
    (group-by (col "SellerG") "Suburb") ;; Mix Column and string types
    (agg
      (-> (mean "Price") (as "mean"))
      (-> (stddev "Price") (as "std"))  ;; No need to do into-array
      (-> (min "Price") (as "min"))
      (-> (max "Price") (as "max")))
    show)

Another inconvenience is having to deal with Scala sequences:

(->> (.collect dataframe) ;; .collect returns an array of Spark rows
     (map
       #(JavaConversions/seqAsJavaList
       (.. % toSeq))))    ;; returns a seq of seqs
                          ;; must zip into map to recover row-like maps

In Geni, (collect dataframe) returns a vector of maps, where the maps serve a similar purpose to Spark rows.

Quick Start

Use Leiningen to create a template of a Geni project:

lein new geni <project-name>

Step into the directory, and run the command lein run!

Installation

Note that geni wraps Apache Spark 2.4.5, which uses Scala 2.12, which has incomplete support for JDK 11. JDK 8 is recommended.

Add the following to your project.clj dependency:

Clojars Project

You would also need to add Spark as provided dependencies. For instance, have the following key-value pair for the :profiles map:

:provided
{:dependencies [[org.apache.spark/spark-core_2.12 "2.4.5"]
                [org.apache.spark/spark-hive_2.12 "2.4.5"]
                [org.apache.spark/spark-mllib_2.12 "2.4.5"]
                [org.apache.spark/spark-sql_2.12 "2.4.5"]
                [org.apache.spark/spark-streaming_2.12 "2.4.5"]]}

Future Work

Features:

  • Data-oriented queries and pipeline stages.
  • Setup on GCP's Dataproc + guide.
  • Clojure docs.

License

Copyright 2020 Zero One Group.

geni is licensed under Apache License v2.0.

Mentions

Some code was taken from:

Can you improve this documentation? These fine people already did:
Anthony Khong & arithmox
Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close