(allows? m1 m2)
True if all keys in m1 that are also in m2 have equal values in both maps.
True if all keys in m1 that are also in m2 have equal values in both maps.
(default-download-error-handler error options context)
By default, when clj-http returns an error (e.g., when the server returns 4xx or 5xx),
Skyscraper will call this function to determine what to do next.
This handler causes Skyscraper to retry up to retries
times for 5xx status codes,
and to throw an exception otherwise.
By default, when clj-http returns an error (e.g., when the server returns 4xx or 5xx), Skyscraper will call this function to determine what to do next. This handler causes Skyscraper to retry up to `retries` times for 5xx status codes, and to throw an exception otherwise.
(defprocessor name & {:as args})
Registers a processor named name
with arguments args
.
name
should be a keyword. args
, optional keys and values, may include:
:process-fn
– a function that takes a resource and a parent context, and returns a
sequence of child contexts (corresponding to the scraped resource). Alternatively,
it can return one context only, in which case it will be wrapped in a sequence.:cache-template
– a string specifying the template for cache keys. Ignored when
:cache-key-fn
is specified.:cache-key-fn
– a function taking the context and returning the cache key. Overrides
:cache-template
. Useful when mere templating does not suffice.:url-fn
– a one-argument function taking the context and returning the URL to visit.
By default, Skyscraper just extracts the value under the :url
key from the context.:updatable
– a boolean (false by default). When true, the pages accessed by this
processor are considered to change often. When Skyscraper is run in update mode (see
below), these pages will be re-downloaded and re-processed even if they had been present
in the HTML or processed caches, respectively.:parse-fn
– a custom function that will be used to produce Enlive resources from
downloaded documents. This can be useful, for instance, if you want to use reaver rather
than Enlive; if you are scraping something other than HTMLs (e.g., PDFs via a custom
parser); or when you’re scraping malformed HTML and need an interim fixup steps before
parsing.:skyscraper.db/columns
– a vector of keys that are supposed to exist in the resulting
contexts; the corresponding values will be emitted as a database row when :db
or
:db-file
is supplied as a scrape argument.:skyscraper.db/key-columns
– a vector of keys that, when
supplied, will be used to upsert records to database and treated as
a unique key to match existing database records against.Registers a processor named `name` with arguments `args`. `name` should be a keyword. `args`, optional keys and values, may include: - `:process-fn` – a function that takes a resource and a parent context, and returns a sequence of child contexts (corresponding to the scraped resource). Alternatively, it can return one context only, in which case it will be wrapped in a sequence. - `:cache-template` – a string specifying the template for cache keys. Ignored when `:cache-key-fn` is specified. - `:cache-key-fn` – a function taking the context and returning the cache key. Overrides `:cache-template`. Useful when mere templating does not suffice. - `:url-fn` – a one-argument function taking the context and returning the URL to visit. By default, Skyscraper just extracts the value under the `:url` key from the context. - `:updatable` – a boolean (false by default). When true, the pages accessed by this processor are considered to change often. When Skyscraper is run in update mode (see below), these pages will be re-downloaded and re-processed even if they had been present in the HTML or processed caches, respectively. - `:parse-fn` – a custom function that will be used to produce Enlive resources from downloaded documents. This can be useful, for instance, if you want to use reaver rather than Enlive; if you are scraping something other than HTMLs (e.g., PDFs via a custom parser); or when you’re scraping malformed HTML and need an interim fixup steps before parsing. - `:skyscraper.db/columns` – a vector of keys that are supposed to exist in the resulting contexts; the corresponding values will be emitted as a database row when `:db` or `:db-file` is supplied as a scrape argument. - `:skyscraper.db/key-columns` – a vector of keys that, when supplied, will be used to upsert records to database and treated as a unique key to match existing database records against.
Local copies of downloaded HTML files go here.
Local copies of downloaded HTML files go here.
(initialize-options options)
Initializes scraping options, ensuring that the caches are
instances of [[CacheBackend]], and a db is present if :db-file
was supplied.
Initializes scraping options, ensuring that the caches are instances of [[CacheBackend]], and a db is present if `:db-file` was supplied.
(initialize-seed {:keys [download-mode pipeline] :as options} seed)
Ensures the seed is a seq and sets up internal keys.
Ensures the seed is a seq and sets up internal keys.
(merge-urls url new-url)
Fills the missing parts of new-url (which can be either absolute, root-relative, or relative) with corresponding parts from url (an absolute URL) to produce a new absolute URL.
Fills the missing parts of new-url (which can be either absolute, root-relative, or relative) with corresponding parts from url (an absolute URL) to produce a new absolute URL.
All Skyscraper output, either temporary or final, goes under here.
All Skyscraper output, either temporary or final, goes under here.
(parse-enlive headers body)
Parses a byte array as a Enlive resource.
Parses a byte array as a Enlive resource.
(parse-reaver headers body)
Parses a byte array as a JSoup/Reaver document.
Parses a byte array as a JSoup/Reaver document.
(parse-string headers body)
(parse-string headers body try-html?)
Parses body
, a byte-array, as a string encoded with
content-type provided in headers
. If try-html?
is true,
tries to look for encoding in the <meta http-equiv> tag
in body
.
Parses `body`, a byte-array, as a string encoded with content-type provided in `headers`. If `try-html?` is true, tries to look for encoding in the <meta http-equiv> tag in `body`.
Cache storing the interim results of processing HTML files.
Cache storing the interim results of processing HTML files.
(respond-with response {:keys [pipeline] :as options} context)
Call this function from download-error-handler
to continue scraping as if download had succeeded.
Call this function from `download-error-handler` to continue scraping as if download had succeeded.
(run-processor processor-name document)
(run-processor processor-name document context)
Runs a processor named by processor-name on document.
Runs a processor named by processor-name on document.
(scrape seed & {:as options})
Runs scraping on seed (an initial context or sequence of contexts), returning a lazy sequence of leaf contexts.
options
may include the ones supported by skyscraper.traverse/launch
,
as well as:
:conn-mgr-options
– Skyscraper will create a clj-http connection manager
with these options (a sync or async one, depending on :download-mode
)
and use it across all HTTP requests it makes.
See [[clj-http.conn-mgr/make-reusable-conn-manager]] and
[[clj-http.conn-mgr/make-reusable-async-conn-manager]] for details on the
options you can specify here.:db
– a clojure.java.jdbc compatible db-spec that, when passed, will
cause scraping to generate a SQL database of results. See
doc/db.md
for a walkthrough. Only supports SQLite.:db-file
– an alternative to :db
, a filename or path that will
be used to construct a SQLite db-spec.:download-error-handler
– a function called when clj-http returns an
error when downloading; see doc/error-handling.md
for details.:download-mode
– can be :async
(default) or :sync
. When async,
Skyscraper will use clj-http's asynchronous mode to make HTTP requests.:html-cache
– the HTTP cache to use. Can be an instance of CacheBackend
,
a string (meaning a directory to use for a filesystem cache), nil
or false
(meaning no cache), or true
(meaning a filesystem cache in the default
location, html-cache-dir
). Defaults to nil
.:http-options
– a map of additional options that will be passed to
[[clj-http.core/request]].:max-connections
– maximum number of HTTP requests that can be active
at any time.:only
– prunes the scrape tree to only include matching contexts; this can be
a map (specifying to only include records whose values, if present, coincide with
the map) or a predicate (meaning to filter contexts on it).:parse-fn
– a function that takes a map of HTTP headers and a byte array
containing the downloaded document, and returns a parsed representation of
that document. Skyscraper provides parse-string
, parse-enlive
, and
parse-reaver
out of the box. Defaults to parse-enlive
.:processed-cache
– the processed cache to use. Same possible values as
for :http-cache
. Defaults to nil
.:request-fn
– the HTTP request function to use. Defaults to [[clj-http.core/request]].
Skyscraper relies on the API of clj-http, so only override this if you
know what you're doing.:retries
– maximum number of times that Skyscraper will retry downloading
a page until it gives up. Defaults to 5.:sleep
– sleep this many milliseconds before each request, or a niladic fn
that returns a number of milliseconds. Useful for throttling. It's probably
best to set :parallelism
to 1 together with this.:uncached-only
– prune the scrape tree, yielding only the nodes that haven't been
scraped yet. See doc/updates.md
.:update
– run in update mode (see doc/updates.md
).Runs scraping on seed (an initial context or sequence of contexts), returning a lazy sequence of leaf contexts. `options` may include the ones supported by [[skyscraper.traverse/launch]], as well as: - `:conn-mgr-options` – Skyscraper will create a clj-http connection manager with these options (a sync or async one, depending on `:download-mode`) and use it across all HTTP requests it makes. See [[clj-http.conn-mgr/make-reusable-conn-manager]] and [[clj-http.conn-mgr/make-reusable-async-conn-manager]] for details on the options you can specify here. - `:db` – a clojure.java.jdbc compatible db-spec that, when passed, will cause scraping to generate a SQL database of results. See `doc/db.md` for a walkthrough. Only supports SQLite. - `:db-file` – an alternative to `:db`, a filename or path that will be used to construct a SQLite db-spec. - `:download-error-handler` – a function called when clj-http returns an error when downloading; see `doc/error-handling.md` for details. - `:download-mode` – can be `:async` (default) or `:sync`. When async, Skyscraper will use clj-http's asynchronous mode to make HTTP requests. - `:html-cache` – the HTTP cache to use. Can be an instance of `CacheBackend`, a string (meaning a directory to use for a filesystem cache), `nil` or `false` (meaning no cache), or `true` (meaning a filesystem cache in the default location, [[html-cache-dir]]). Defaults to `nil`. - `:http-options` – a map of additional options that will be passed to [[clj-http.core/request]]. - `:max-connections` – maximum number of HTTP requests that can be active at any time. - `:only` – prunes the scrape tree to only include matching contexts; this can be a map (specifying to only include records whose values, if present, coincide with the map) or a predicate (meaning to filter contexts on it). - `:parse-fn` – a function that takes a map of HTTP headers and a byte array containing the downloaded document, and returns a parsed representation of that document. Skyscraper provides [[parse-string]], [[parse-enlive]], and [[parse-reaver]] out of the box. Defaults to [[parse-enlive]]. - `:processed-cache` – the processed cache to use. Same possible values as for `:http-cache`. Defaults to `nil`. - `:request-fn` – the HTTP request function to use. Defaults to [[clj-http.core/request]]. Skyscraper relies on the API of clj-http, so only override this if you know what you're doing. - `:retries` – maximum number of times that Skyscraper will retry downloading a page until it gives up. Defaults to 5. - `:sleep` – sleep this many milliseconds before each request, or a niladic fn that returns a number of milliseconds. Useful for throttling. It's probably best to set `:parallelism` to 1 together with this. - `:uncached-only` – prune the scrape tree, yielding only the nodes that haven't been scraped yet. See `doc/updates.md`. - `:update` – run in update mode (see `doc/updates.md`).
(scrape! seed & {:as options})
Like scrape, but eager: terminates after scraping has succeeded. Returns nil.
Pass :db
, :db-file
, :leaf-chan
, or :item-chan
to access scraped data.
options
are the same as in scrape!
.
Like scrape, but eager: terminates after scraping has succeeded. Returns nil. Pass `:db`, `:db-file`, `:leaf-chan`, or `:item-chan` to access scraped data. `options` are the same as in `scrape!`.
(signal-error error context)
Call this function from download-error-handler
to cause scraping to signal an error.
Call this function from `download-error-handler` to cause scraping to signal an error.
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close