Skyscraper can automatically emit the result of scraping in the form of a relational database. One table will be created for each database-enabled processor, with each row corresponding to an output context created by that processor. Each row will be automatically assigned an ID, and parent-child relationships will be modelled as foreign keys.
The only database management system currently supported is SQLite (Skyscraper comes with a dependency on SQLite), but support for other DBMSs is planned for future versions of Skyscraper.
To specify that a table should be generated for a given processor, add a :skyscraper.db/columns
option to defprocessor
. For example:
(defprocessor :users
:process-fn (fn [res ctx]
[{:name "John", :surname "Doe"}])
:skyscraper.db/columns [:name :surname])
(For the sake of clarity and focus, this document doesn’t rely on any particular site’s structure. Our example processors will return hardcoded data.)
Now, when you invoke Skyscraper like this:
(scrape! [{:url "http://example.com", :processor :users}]
:db-file "/tmp/demo.sqlite")
Skyscraper will create a SQLite database in the given file, containing one table named users
with four columns: two textual ones that you have specified, and two additional integer ones named id
and parent
. That is, it will conform to the following schema:
CREATE TABLE users (id integer primary key, parent integer, name text, surname text);
The id
will be an internal, autogenerated primary key. It is not guaranteed to be stable – it is possible for two identical invocations of Skyscraper to generate different tables.
The parent
column will be described below.
Let us expand our worked example. Consider the following processor definitions:
(defprocessor :users
:process-fn (fn [res ctx]
[{:name "John", :surname "Doe", :url "/", :processor :accounts}])
:skyscraper.db/columns [:name :surname])
(defprocessor :accounts
:process-fn (fn [res ctx]
[{:bank-account "0123-4567"}
{:bank-account "8888-9999"}])
:skyscraper.db/columns [:bank-account])
Running scrape!
as above will now generate the following database:
sqlite> select * from users;
id = 1
parent =
name = John
surname = Doe
sqlite> select * from accounts;
id = 1
parent = 1
bank_account = 0123-4567
id = 2
parent = 1
bank_account = 8888-9999
Because, in the scrape tree, nodes corresponding to the :accounts
processor are children of those of :users
, the parent
column in the account
user table references the id
in users
.
Note that this database doesn’t contain redundant data, but you can still easily obtain user data for each account by simply JOIN
ing the tables together.
There’s a gotcha: if you re-run Skyscraper with the above settings, it will duplicate the already existing records in the database. This is because, normally, there is no way tell whether the newly-scraped records correspond to data we already have or not. For example, the records may differ in some details (e.g., timestamps), but still refer to the same entity.
Therefore, you have to be explicit about which fields uniquely identify a context for a given DB-enabled processor. For instance:
(defprocessor :users
:process-fn (fn [res ctx]
[{:name "John", :surname "Doe", :phone "123-4567"}])
:skyscraper.db/columns [:name :surname :phone]
:skyscraper.db/key-columns [:name :surname])
In this case, rather than bluntly executing an INSERT
for each encountered row, Skyscraper will only insert the row when it doesn’t already exist in the DB. If it does exist, Skyscraper will update it when necessary. For example, imagine John’s phone number changes:
(defprocessor :users
:process-fn (fn [res ctx]
[{:name "John", :surname "Doe", :phone "765-4321"}])
:skyscraper.db/columns [:name :surname :phone]
:skyscraper.db/key-columns [:name :surname])
A repeated invocation of Skyscraper will now produce a database with the updated record.
Note: If you use this feature, Skyscraper will create an unique index for every occurrence of :skyscraper.db/key-columns
, slowing down inserting (as SQLite needs to update the index on every insert). You can pass the :ignore-db-keys true
option to revert to plain INSERT
s; if the database didn’t exist prior to scraping then Skyscraper will do that automatically. It is currently advisable to use :skyscraper.db/key-columns
for small scrapes only, and to regenerate the DB from scratch each time for larger scrapes.
Often, you have a paginated structure where a page contains a number of records you’re interested in, plus a link to the next page. A natural approach is to have the processor return the "next-page" context along with the actual records. So the return value of scraping /page/1
might look like this:
[{:name "John Doe", :url "/person/1", :processor :person}
{:name "Jane Smith", :url "/person/2", :processor :person}
;; more records
{:page 2, :url "/page/2", :processor :page}]
If you DB-enable such a processor (presumably with the :name
column), Skyscraper will duly emit a null-name row for the last context. There are several ways to cope with this:
page
processor returning just the records of interest.NOT NULL
in your queries where appropriate. In general, it is a good idea to treat the Skyscraper-generated database as an interim step of your scraping flow, and have a cleaning step further downstream (the data structure you’re trying to recreate will sometimes not correspond faithfully to the scraping structure).Can you improve this documentation?Edit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close