Skyscraper is implemented using core.async and by default, scraping happens asynchronously on multiple threads. There are several entry-point functions to invoke scraping; they all take the same arguments (see the docstring of the first one for details), but differ in how they return results.
skyscraper.core/scrape
: lazy sequence of contextsThis is the only mode that was available in Skyscraper versions prior to 0.3. The scrape
function returns a lazy sequence of contexts that are leaves in the scraping tree. After fully consuming this sequence, the underlying core.async channel will be closed and the worker threads terminated.
skyscraper.core/scrape!
: imperative, eagerWhile scrape
tries to provide a functional, sequence-based interface, scrape!
takes the stance that scraping is a side-effectful process. It is eager (returns after scraping has completed) and returns nil
. There are two ways to actually access the scraping results:
:db
or :db-file
options. This will cause scrape!
to output a SQLite database. See details.:leaf-chan
or :item-chan
options: core.async channels to which vectors of leaf contexts or interim contexts, respectively, will be delivered.skyscraper.dev/scrape
: for interactive use in the REPLThis one is described in Development Mode.
skyscraper.traverse/launch
: advancedScraping in Skyscraper is actually implemented atop a more primitive abstraction: parallelized context tree traversal (see the docstring on skyscraper.traverse
). All other scraping functions are implemented on top of launch
. Use it in the rare cases when you want strict control over when to terminate scraping.
Scraping spins up a number of worker threads that actually perform scraping (parsing HTML, running processors, and sometimes downloading data – see below). The number of worker threads is configurable: by default, it’s 4, but you can override it by the :parallelism
setting.
By default, Skyscraper downloads pages using clj-http’s async request facility. This means that the downloads actually happen on threads other than the ones managed by Skyscraper. You can change this by specifying :download-mode :sync
.
When :download-mode
is :async
mode, there can be more simultaneous downloads than Skyscraper’s worker threads. Use :max-connections
to limit that number.
Use :parallelism 1 :download-mode :sync
to simulate sequential scraping flow à la Skyscraper 0.2. This is useful when you want to precisely control the order in which your pages will be scraped.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close