:use-http-headers-from-content
that can be set
to false
to disable charset detection based on the HTML response body.cached-document
for accessing a previous (cached)
version of a page downloaded while in update mode.This release corrects the issue in 0.3.3 that caused its pom.xml to not include dependencies, but is otherwise the same.
:skyscraper/description
key on contexts. These descriptions will be
logged when downloading, instead of the URL, and won’t be propagated
to child contexts.scrape!
and one of the processors throws an exception.Content-type
header.:skyscraper.traverse/priority
is no longer propagated to
child contexts.:cache-template
.parse-fn
is now expected to take three arguments, the third being
the context. The aim of this change is to support cases where the
HTML is known to be malformed and needs context-aware preprocessing
before parsing. Built-in parse-fns have been updated to take the
additional argument.java.io.Closeable
in addition to CacheBackend
. Built-in backends have been
updated to include no-op close
methods.:skyscraper.db/key-columns
when creating the DB from
scratch. There is also a new option, :ignore-db-keys
, to force
this at all times.:download-mode
set to :sync
.:retries
.scrape
function that returns a lazy sequence of nodes, there is an
alternative, non-lazy, imperative interface (scrape!
) that treats producing new results as
side-effects.:parse-fn
and :http-options
can now be provided either per-page or globally. (Thanks to Alexander Solovyov for the suggestion.)process-fn
.skyscraper
namespace has been renamed to skyscraper.core
.defprocessor
now takes a keyword name, and registers a function in the
global registry instead of defining it. This means that it’s no longer possible
to call one processor from another: if you need that, define process-fn
as a
named function.:processor
keys are now expected to
be keywords.scrape
no longer guarantees the order in which the site will be scraped.
In particular, two different invocations of scrape
are not guaranteed to return
the scraped data in the same order. If you need that guarantee, set
parallelism
and max-connections
to 1.get-cache-keys
has been removed. If you want the same effect, include :cache-key
in the desired contexts.:only
now doesn’t barf on keys not appearing in seed.MemoryCache
.download
now supports arbitrarily many retries.get-cache-keys
.scrape
and friends can now accept a keyword as the first argument.:cache-key
key in the context).scrape
options: :only
and :postprocess
.scrape-csv
now accepts an :all-keys
argument and has been rewritten using a helper function, save-dataset-to-csv
.scrape-csv
.:updatable
,
scrape
now has an :update
option.scrape
option: :retries
.OutOfMemoryError
.
(scrape
no longer holds onto the head of the lazy seq it produces).processed-cache
option to scrape
now works as advertised.scrape
option: :html-cache
. (Thanks to ayato-p.)defprocessor
clauses: :url-fn
and :cache-key-fn
.
:url
key.process-fn
functions) can now access current context.decode-body-headers
feature.scrape
now supports a http-options
argument to override HTTP options (e.g., timeouts).Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close