Liking cljdoc? Tell your friends :D

API DOCUMENTATION

Dataframe

Features
  • Unlimited size

    Theoretically speaking, it supports dataset larger than memory to infinity!

  • All native types

    All the datatypes used to store data is native Clojure (or Java) types!

  • From file to file

    Integrate IO inside the dataframe. No need to write your own read-in and output functions!

  • Distributed (coming soon)

    Most operations could be distributed to different computers in a clusters. See the principle in Onyx

  • Lazy operations

    Some operations will not be executed immediately. Dataframe will intelligently pipeline the operations altogether in computation.

Basic Information
  • Most operations to the dataframe is performed lazily and all at once with compute except sort and join.
  • The dataframe process the data in rows, ie one row in one vector.
  • The input dataframe can be larger than memory in size.
  • By default, all columns have the same type: string. You are allowed to set its type, with our predefined type keywords.
API
  • filter

    Filters the data frame by rows

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    columnsString / collection of stringsThe columns the predicate function to apply to
    predicateFunctionThe predicate function to determine if a row should be keptThis function should have the same number of arguments with the above columns and in the same order. Only rows that return true will be kept.

    Example

    (filter x "Salary" (fn [salary] (<= salary 800)))
    ;; this statement deletes all the rows that have a salary larger than 800
    (filter x ["Salary" "Department"] (fn [salary dept] (and (<= salary 800) (= dept "computer science"))))
    ;; keeps only people from computer science department with salary not larger than 800
    
  • set-type

    Set the type of a column. So when using the value of that column, it would be in that type.

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    typeStringType of the columnThe native support types are: int, double, string, date. Note that by default all the column types are string. If you need special parsing function, see set-parser.
    columnStringTarget columnsShould be existing columns within the dataframe

    Example

    (set-type x "Salary" "double")
    ;; makes the column Salary doubles
    
  • set-parser

    A more flexible way to set type.

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    parserfunctionThe parser function that will parse a string to other types (or even string)The function should take only one argument which is a string, and the parsed type should be serializable.
    columnStringTarget columnsShould be existing columns within the dataframe

    Example

    (set-parser x "Salary" Double/parseDouble)
    ;; parse all the values in Salary with this function
    
  • operate

    In place modification on a single column

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    operationfunctionFunction to be applied lazilyThe function should take only one argument which is the value of the below column
    column nameKeywordTarget columnsShould be existing columns within the dataframe

    Example

    (set-type x "Salary" "double")
    (operate x - "Salary")
    ;; takes the negative of column Salary
    
  • operate

    Calculate the result and store in a new column

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    operationfunctionFunction to be applied lazilyArgument number should be complied with the column names below, ie if operation functions takes two arguments, the length of column names should also be 2, and in the same order to be passed to the function
    column name(s)String or collection of StringTarget columnsShould be existing columns within the dataframe
    new columnStringResultant columnShould be new column other than the dataframe

    Example

    (operate x str ["Employee" "EmployeeName"] "new")
    ;; concats the two columns into the "new" column
    
  • group-by

    Group by the dataframe with some columns (always use together with aggregate), or the result by applying the function to the column

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    groupby-keysString / CollectionGroup by columns (functions of columns)Find the specification here

    Example

    (group-by x ["Department" "DepartmentName"])
    ;; group by both columns
    

Group-by Keys Specification

Group-by functions requirements:

  • Take one argument
  • Return type: int / double / string

One general rule is to put the group-by function and its corresponding column name together.

(defn rem10
  "Get the reminder of the num by 10"
  [num]
  (rem num 10))

(group-by x [rem10 "Salary"])
;; or
(group-by x [[rem10 "Salary"]])

If no group-by function, the column name can be alone.

(group-by x "Salary")
;; or
(group-by x ["Salary"])

You can also group by the combination of keys. (Use the above two rules together)

(group-by x [[rem10 "Salary"] "Department"])
;; or
(group-by x [[rem10 "Salary"] ["Department"]])
  • aggregate

    Aggregate the grouped dataframes with some functions. The aggregation function will be applied to every columns registered in sequence.

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    aggregation functionfunctionFunction to be applied to each columnShould take a collection as argument. And return one or a collection of predefined type*.
    column name(s)String or collection of StringAggregate columnsShould be existing columns within the dataframe
    [new column]String or collection of stringResultant columnShould be new columns not in the dataframe

    Example

    (aggregate x clojask/min ["Employee" "EmployeeName"] ["new" "new2"])
    ;; get the min of the two columns grouped by ...
    
  • inner-join / left-join / right-join

    Inner / left / right join two dataframes by some columns

    Remarks:

    Join functions are immediate actions, which will be executed at once.

    Will automatically pipeline the registered operations and filters like compute. You could think of join as first compute the two dataframes then join.

    ArgumentTypeFunctionRemarks
    dataframe aClojask.DataFrameThe operated object
    dataframe bClojask.DataFrameThe operated object
    a join keysString / CollectionThe keys of a to be alignedFind the specification here
    b join keysString / CollectionThe keys of b to be alignedFind the specification here

Return

A Clojask.JoinedDataFrame

  • Unlike Clojask.DataFrame, it only supports three operations:
    • print-df
    • get-col-names
    • compute
  • This means you cannot further apply complicated operations to a joined dataframe. An alternative is to first compute the result, then read it in as a new dataframe.

Example

(def x (dataframe "path/to/a"))
(def y (dataframe "path/to/b"))

(def z (inner-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
(compute z 8 "path/to/output")
;; inner join x and y

(def z (left-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
(compute z 8 "path/to/output")
;; left join x and y

(def z (right-join x y ["col a 1" "col a 2"] ["col b 1" "col b 2"]))
(compute z 8 "path/to/output")
;; right join x and y
  • reorderCol / renameCol

    Reorder the columns / rename the column names in the dataframe

    ArgumentTypeFunctionRemarks
    dataframe aClojask.DataFrameThe operated object
    a columnsClojure.collectionThe new set of column namesShould be existing headers in dataframe a if it is reorderCol

    Example

    (.reorderCol y ["Employee" "Department" "EmployeeName" "Salary"])
    (.renameCol y ["Employee" "new-Department" "EmployeeName" "Salary"])
    
  • sort

    Immediately sort the dataframe

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    trending listCollection (seq vector)Indicates the sort orderExample: ["Salary" "+" "Employee" "-"] means that sort the Salary in ascending order, if equal sort the Employee in descending order
    output-directoryStringThe output path

    Example

    (sort y ["+" "Salary"] "resources/sort.csv")
    ;; sort by Salary ascendingly
    
  • compute

    Compute the result. The pre-defined lazy operations will be executed in pipeline, ie the result of the previous operation becomes the argument of the next operation.

    ArgumentTypeFunctionRemarks
    dataframeClojask.DataFrameThe operated object
    num of workersint (max 8)The number of worker instances (except the input and output nodes)Use onyx as the distributed platform
    output pathStringThe path of the output csv fileCould exist or not.
    [exception]booleanWhether an exception during calculation will cause terminationIs useful for debugging or detecting empty fields
    [select]String / Collection of stringsThe name of the columns to select. Better to first refer to function get-col-names about all the names. (Similar to SELECT in sql )Can only specify either of select and exclude
    [exclude]String / Collection of stringsThe name of the columns to excludeCan only specify either of select and exclude

    Example

    (compute x 8 "../resources/test.csv" :exception true)
    ;; computes all the pre-registered operations
    
    (compute x 8 "../resources/test.csv" :select "col a")
    ;; only select column a
    
    (compute x 8 "../resources/test.csv" :select ["col b" "col a"])
    ;; select two columns, column b and column a in order
    
    (compute x 8 "../resources/test.csv" :exclude ["col b" "col a"])
    ;; select all columns except column b and column a, other columns are in order
    

Can you improve this documentation? These fine people already did:
Yuchen Liu & Angel Woo
Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close