Liking cljdoc? Tell your friends :D

skyscraper.core


allows?clj

(allows? m1 m2)

True if all keys in m1 that are also in m2 have equal values in both maps.

True if all keys in m1 that are also in m2 have equal values in both maps.
sourceraw docstring

default-download-error-handlerclj

(default-download-error-handler error options context)

By default, when clj-http returns an error (e.g., when the server returns 4xx or 5xx), Skyscraper will call this function to determine what to do next. This handler causes Skyscraper to retry up to retries times for 5xx status codes, and to throw an exception otherwise.

By default, when clj-http returns an error (e.g., when the server returns 4xx or 5xx),
Skyscraper will call this function to determine what to do next.
This handler causes Skyscraper to retry up to `retries` times for 5xx status codes,
and to throw an exception otherwise.
sourceraw docstring

default-optionsclj

Default scraping options.

Default scraping options.
sourceraw docstring

defprocessorclj

(defprocessor name & {:as args})

Registers a processor named name with arguments args.

name should be a keyword. args, optional keys and values, may include:

  • :process-fn – a function that takes a resource and a parent context, and returns a sequence of child contexts (corresponding to the scraped resource). Alternatively, it can return one context only, in which case it will be wrapped in a sequence.
  • :cache-template – a string specifying the template for cache keys. Ignored when :cache-key-fn is specified.
  • :cache-key-fn – a function taking the context and returning the cache key. Overrides :cache-template. Useful when mere templating does not suffice.
  • :url-fn – a one-argument function taking the context and returning the URL to visit. By default, Skyscraper just extracts the value under the :url key from the context.
  • :updatable – a boolean (false by default). When true, the pages accessed by this processor are considered to change often. When Skyscraper is run in update mode (see below), these pages will be re-downloaded and re-processed even if they had been present in the HTML or processed caches, respectively.
  • :parse-fn – a custom function that will be used to produce Enlive resources from downloaded documents. This can be useful, for instance, if you want to use reaver rather than Enlive; if you are scraping something other than HTMLs (e.g., PDFs via a custom parser); or when you’re scraping malformed HTML and need an interim fixup steps before parsing.
  • :skyscraper.db/columns – a vector of keys that are supposed to exist in the resulting contexts; the corresponding values will be emitted as a database row when :db or :db-file is supplied as a scrape argument.
  • :skyscraper.db/key-columns – a vector of keys that, when supplied, will be used to upsert records to database and treated as a unique key to match existing database records against.
Registers a processor named `name` with arguments `args`.

`name` should be a keyword. `args`, optional keys and values, may include:

- `:process-fn` – a function that takes a resource and a parent context, and returns a
  sequence of child contexts (corresponding to the scraped resource). Alternatively,
  it can return one context only, in which case it will be wrapped in a sequence.
- `:cache-template` – a string specifying the template for cache keys. Ignored when
  `:cache-key-fn` is specified.
- `:cache-key-fn` – a function taking the context and returning the cache key. Overrides
  `:cache-template`. Useful when mere templating does not suffice.
- `:url-fn` – a one-argument function taking the context and returning the URL to visit.
  By default, Skyscraper just extracts the value under the `:url` key from the context.
- `:updatable` – a boolean (false by default). When true, the pages accessed by this
  processor are considered to change often. When Skyscraper is run in update mode (see
  below), these pages will be re-downloaded and re-processed even if they had been present
  in the HTML or processed caches, respectively.
- `:parse-fn` – a custom function that will be used to produce Enlive resources from
  downloaded documents. This can be useful, for instance, if you want to use reaver rather
  than Enlive; if you are scraping something other than HTMLs (e.g., PDFs via a custom
  parser); or when you’re scraping malformed HTML and need an interim fixup steps before
  parsing.
- `:skyscraper.db/columns` – a vector of keys that are supposed to exist in the resulting
  contexts; the corresponding values will be emitted as a database row when `:db` or
  `:db-file` is supplied as a scrape argument.
- `:skyscraper.db/key-columns` – a vector of keys that, when
  supplied, will be used to upsert records to database and treated as
  a unique key to match existing database records against.
sourceraw docstring

html-cache-dirclj

Local copies of downloaded HTML files go here.

Local copies of downloaded HTML files go here.
sourceraw docstring

initialize-optionsclj

(initialize-options options)

Initializes scraping options, ensuring that the caches are instances of [[CacheBackend]], and a db is present if :db-file was supplied.

Initializes scraping options, ensuring that the caches are
instances of [[CacheBackend]], and a db is present if `:db-file`
was supplied.
sourceraw docstring

initialize-seedclj

(initialize-seed {:keys [download-mode pipeline] :as options} seed)

Ensures the seed is a seq and sets up internal keys.

Ensures the seed is a seq and sets up internal keys.
sourceraw docstring

merge-urlsclj

(merge-urls url new-url)

Fills the missing parts of new-url (which can be either absolute, root-relative, or relative) with corresponding parts from url (an absolute URL) to produce a new absolute URL.

Fills the missing parts of new-url (which can be either absolute,
root-relative, or relative) with corresponding parts from url
(an absolute URL) to produce a new absolute URL.
sourceraw docstring

output-dirclj

All Skyscraper output, either temporary or final, goes under here.

All Skyscraper output, either temporary or final, goes under here.
sourceraw docstring

parse-enliveclj

(parse-enlive headers body)

Parses a byte array as a Enlive resource.

Parses a byte array as a Enlive resource.
sourceraw docstring

parse-reaverclj

(parse-reaver headers body)

Parses a byte array as a JSoup/Reaver document.

Parses a byte array as a JSoup/Reaver document.
sourceraw docstring

parse-stringclj

(parse-string headers body)
(parse-string headers body try-html?)

Parses body, a byte-array, as a string encoded with content-type provided in headers. If try-html? is true, tries to look for encoding in the <meta http-equiv> tag in body.

Parses `body`, a byte-array, as a string encoded with
content-type provided in `headers`. If `try-html?` is true,
tries to look for encoding in the <meta http-equiv> tag
in `body`.
sourceraw docstring

processed-cache-dirclj

Cache storing the interim results of processing HTML files.

Cache storing the interim results of processing HTML files.
sourceraw docstring

respond-withclj

(respond-with response {:keys [pipeline] :as options} context)

Call this function from download-error-handler to continue scraping as if download had succeeded.

Call this function from `download-error-handler` to continue scraping as if download had succeeded.
sourceraw docstring

run-processorclj

(run-processor processor-name document)
(run-processor processor-name document context)

Runs a processor named by processor-name on document.

Runs a processor named by processor-name on document.
sourceraw docstring

scrapeclj

(scrape seed & {:as options})

Runs scraping on seed (an initial context or sequence of contexts), returning a lazy sequence of leaf contexts.

options may include the ones supported by skyscraper.traverse/launch, as well as:

  • :conn-mgr-options – Skyscraper will create a clj-http connection manager with these options (a sync or async one, depending on :download-mode) and use it across all HTTP requests it makes. See [[clj-http.conn-mgr/make-reusable-conn-manager]] and [[clj-http.conn-mgr/make-reusable-async-conn-manager]] for details on the options you can specify here.
  • :db – a clojure.java.jdbc compatible db-spec that, when passed, will cause scraping to generate a SQL database of results. See doc/db.md for a walkthrough. Only supports SQLite.
  • :db-file – an alternative to :db, a filename or path that will be used to construct a SQLite db-spec.
  • :download-error-handler – a function called when clj-http returns an error when downloading; see doc/error-handling.md for details.
  • :download-mode – can be :async (default) or :sync. When async, Skyscraper will use clj-http's asynchronous mode to make HTTP requests.
  • :html-cache – the HTTP cache to use. Can be an instance of CacheBackend, a string (meaning a directory to use for a filesystem cache), nil or false (meaning no cache), or true (meaning a filesystem cache in the default location, html-cache-dir). Defaults to nil.
  • :http-options – a map of additional options that will be passed to [[clj-http.core/request]].
  • :max-connections – maximum number of HTTP requests that can be active at any time.
  • :only – prunes the scrape tree to only include matching contexts; this can be a map (specifying to only include records whose values, if present, coincide with the map) or a predicate (meaning to filter contexts on it).
  • :parse-fn – a function that takes a map of HTTP headers and a byte array containing the downloaded document, and returns a parsed representation of that document. Skyscraper provides parse-string, parse-enlive, and parse-reaver out of the box. Defaults to parse-enlive.
  • :processed-cache – the processed cache to use. Same possible values as for :http-cache. Defaults to nil.
  • :request-fn – the HTTP request function to use. Defaults to [[clj-http.core/request]]. Skyscraper relies on the API of clj-http, so only override this if you know what you're doing.
  • :retries – maximum number of times that Skyscraper will retry downloading a page until it gives up. Defaults to 5.
  • :sleep – sleep this many milliseconds before each request, or a niladic fn that returns a number of milliseconds. Useful for throttling. It's probably best to set :parallelism to 1 together with this.
  • :uncached-only – prune the scrape tree, yielding only the nodes that haven't been scraped yet. See doc/updates.md.
  • :update – run in update mode (see doc/updates.md).
Runs scraping on seed (an initial context or sequence of contexts), returning
a lazy sequence of leaf contexts.

`options` may include the ones supported by [[skyscraper.traverse/launch]],
as well as:

- `:conn-mgr-options` – Skyscraper will create a clj-http connection manager
  with these options (a sync or async one, depending on `:download-mode`)
  and use it across all HTTP requests it makes.
  See [[clj-http.conn-mgr/make-reusable-conn-manager]] and
  [[clj-http.conn-mgr/make-reusable-async-conn-manager]] for details on the
  options you can specify here.
- `:db` – a clojure.java.jdbc compatible db-spec that, when passed, will
  cause scraping to generate a SQL database of results. See
  `doc/db.md` for a walkthrough. Only supports SQLite.
- `:db-file` – an alternative to `:db`, a filename or path that will
  be used to construct a SQLite db-spec.
- `:download-error-handler` – a function called when clj-http returns an
  error when downloading; see `doc/error-handling.md` for details.
- `:download-mode` – can be `:async` (default) or `:sync`. When async,
  Skyscraper will use clj-http's asynchronous mode to make HTTP requests.
- `:html-cache` – the HTTP cache to use. Can be an instance of `CacheBackend`,
  a string (meaning a directory to use for a filesystem cache), `nil` or `false`
  (meaning no cache), or `true` (meaning a filesystem cache in the default
  location, [[html-cache-dir]]). Defaults to `nil`.
- `:http-options` – a map of additional options that will be passed to
  [[clj-http.core/request]].
- `:max-connections` – maximum number of HTTP requests that can be active
  at any time.
- `:only` – prunes the scrape tree to only include matching contexts; this can be
  a map (specifying to only include records whose values, if present, coincide with
  the map) or a predicate (meaning to filter contexts on it).
- `:parse-fn` – a function that takes a map of HTTP headers and a byte array
  containing the downloaded document, and returns a parsed representation of
  that document. Skyscraper provides [[parse-string]], [[parse-enlive]], and
  [[parse-reaver]] out of the box. Defaults to [[parse-enlive]].
- `:processed-cache` – the processed cache to use. Same possible values as
  for `:http-cache`. Defaults to `nil`.
- `:request-fn` – the HTTP request function to use. Defaults to [[clj-http.core/request]].
  Skyscraper relies on the API of clj-http, so only override this if you
  know what you're doing.
- `:retries` – maximum number of times that Skyscraper will retry downloading
  a page until it gives up. Defaults to 5.
- `:sleep` – sleep this many milliseconds before each request, or a niladic fn
  that returns a number of milliseconds. Useful for throttling. It's probably
  best to set `:parallelism` to 1 together with this.
- `:uncached-only` – prune the scrape tree, yielding only the nodes that haven't been
  scraped yet. See `doc/updates.md`.
- `:update` – run in update mode (see `doc/updates.md`).
sourceraw docstring

scrape!clj

(scrape! seed & {:as options})

Like scrape, but eager: terminates after scraping has succeeded. Returns nil. Pass :db, :db-file, :leaf-chan, or :item-chan to access scraped data.

options are the same as in scrape!.

Like scrape, but eager: terminates after scraping has succeeded. Returns nil.
Pass `:db`, `:db-file`, `:leaf-chan`, or `:item-chan` to access scraped data.

`options` are the same as in `scrape!`.
sourceraw docstring

signal-errorclj

(signal-error error context)

Call this function from download-error-handler to cause scraping to signal an error.

Call this function from `download-error-handler` to cause scraping to signal an error.
sourceraw docstring

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close