Liking cljdoc? Tell your friends :D

Updating scraped sites

Often, once you successfully scrape a site in full, you then want to periodically update the scraped data: redownload and rescrape only what is necessary. Skyscraper offers several options to assist you.

The brute-force way: wiping cache

If you’re using either of the HTML or processed caches (see Caching), then Skyscraper will reuse the already downloaded data for further processing. This means that rerunning a successful scrape with enabled caching will not trigger any HTTP request, even if the original site has changed.

The most obvious (but also slowest) way to proceed is by clearing the cache (e.g., rm -r ~/skyscraper-data/cache), forcing Skyscraper to redownload everything.

The on-demand way: `:update` and `:updatable`

You can mark some processors as :updatable. These will typically correspond to non-leaf nodes of your scraping tree.

(defprocessor :landing-page
  :cache-template "mysite/index"
  :updatable true
  :process-fn …)

The value for :updatable can be either true (meaning “always update”), false (meaning “never update” – the default), or a function that should take a context and decide whether to update.

Just setting :updatable has no effect on its own. However, when you invoke one of the scraping entry-points with :update set to true, Skyscraper will force re-downloading and re-processing of an updatable page.

The optimization: `:uncached-only`

Regardless of whether :update is enabled or not, Skyscraper normally processes the whole site (some of it potentially coming from the cache). Sometimes, you want to prune the scraping tree to uncached or updatable pages only, so that scraping only yields contexts corresponding to pages that are actually new.

The :uncached-only option to scrape does exactly that.

Be aware that in this mode scraping can do too little: pruning a page from the scraping tree also means pruning the entire subtree rooted at that page. Use it judiciously.

❮Caching Parsing❯

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

Keyboard shortcuts Report a problem cljdoc on GitHub

× close

Updating scraped sites

The brute-force way: wiping cache

The on-demand way: :update and :updatable

The optimization: :uncached-only

The on-demand way: `:update` and `:updatable`

The optimization: `:uncached-only`