Liking cljdoc? Tell your friends :D


Circle CI

Pegasus is a highly-modular, durable and scalable crawler for clojure.

Parallelism is achieved with core.async. Durability is achieved with durable-queue and LMDB.

A blog post on how pegasus works: [link]


Leiningen dependencies:

Clojars Project

A few example crawls:

This one crawls 20 docs from my blog (

URLs are extracted using enlive selectors.

  (:require [pegasus.core :refer [crawl]]
            [pegasus.dsl :refer :all])
  (:import ( StringReader)))

(defn crawl-sp-blog
  (crawl {:seeds [""]
          :user-agent "Pegasus web crawler"
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"})) ;; store all crawl data in /tmp/sp-blog-corpus/

(defn crawl-sp-blog-custom-extractor
  (crawl {:seeds [""]
          :user-agent "Pegasus web crawler"
          :extractor (defextractors
                       (extract :at-selector [:article :header :h2 :a]

                                :follow :href

                                :with-regex #"")
                       (extract :at-selector [:ul.pagination :a]

                                :follow :href
                                :with-regex #""))
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

Say you want more control and want to avoid the DSL, you can use the underlying machinery directly. Here's an example using XPaths to extract links.

(ns your.namespace
  (:require [org.bovinegenius.exploding-fish :as uri]
            [net.cgrand.enlive-html :as html]
            [pegasus.core :refer [crawl]]
            [clj-xpath.core :refer [$x $x:text xml->doc]]))

(deftype XpathExtractor []
    [this config]
    [this obj config]
    (when (= ""
             (-> obj :url uri/host))
      (let [url (:url obj)
            resource (try (-> obj
                          (catch Exception e nil))
            ;; extract the articles
            articles (map
                      (try ($x "//item/link" resource)
                           (catch Exception e nil)))]
        ;; add extracted links to the supplied object
        (merge obj
               {:extracted articles}))))

    [this config]

(defn crawl-sp-blog-xpaths
  (crawl {:seeds [""]
          :user-agent "Pegasus web crawler"
          :extractor (->XpathExtractor)
          :corpus-size 20 ;; crawl 20 documents
          :job-dir "/tmp/sp-blog-corpus"}))

;; start crawling


Copyright © 2015-2018 Shriphani Palakodety

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

Can you improve this documentation? These fine people already did:
Shriphani Palakodety, case & dhruvbhatia
Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close