By default, Skyscrapers parses the downloaded documents into Enlive resources. However, this can be overridden by supplying a custom parse function.
Parse functions take a byte array (a binary blob), a map of HTTP headers (clj-http’s header map), and the current context, and should return a parsed representation of the document. Out of the box, Skyscraper provides three such functions:
parse-enlive
– parses the blob as a HTML document with Enlive and returns an Enlive resource (a seq of maps);parse-reaver
– parses the blob as a HTML document with Reaver and returns an instance of org.jsoup.nodes.Document
;parse-string
– parses the blob as a string.As you can see, the parsed representation can be anything, as long as your processors can work with it. The output of the parse function will be fed to the processor’s :process-fn
as the first argument.
You can specify a parse function in two ways:
:parse-fn
option to scrape
or scrape!
;:parse-fn
clause in defprocessor
.If both are specified, the per-processor definition prevails. See examples/tree_species.clj
for an example of a scraper that uses Reaver, and
overridable-parse-fn-test
in test/skyscraper/real_test.clj
for an example of a per-processor scraping function.
You can implement custom parse functions yourself, e.g., for parsing PDF document, CSV sheets, etc. If you are parsing a text-based format, call parse-string
first and
process the resulting string – this will ensure that the blob is interpreted with a correct encoding as specified in the headers. Both parse-enlive
and parse-reaver
do this.
You might be wondering why. After all, parsing should be generic and not depend on the particular place you’re in, right?
That’ll be the case 99% of the time — and indeed the built-in parse functions just ignore the third argument — but sometimes you’ll need the context. For example, in one of my scrapers, I need to parse content that is essentially HTML, but retrieved from an archive that sometimes returns garbled data: random sequences of bytes that needs to be cut off before proceeding with normal HTML parsing. Sometimes this falls in the middle of a HTML tag. I solve this by having a standalone per “cut-off list” that a custom parse function uses, based on the context, to remove the garbage.
Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close