Liking cljdoc? Tell your friends :D

Database integration

Introduction

Skyscraper can automatically emit the result of scraping in the form of a relational database. One table will be created for each database-enabled processor, with each row corresponding to an output context created by that processor. Each row will be automatically assigned an ID, and parent-child relationships will be modelled as foreign keys.

The only database management system currently supported is SQLite (Skyscraper comes with a dependency on SQLite), but support for other DBMSs is planned for future versions of Skyscraper.

To specify that a table should be generated for a given processor, add a :skyscraper.db/columns option to defprocessor. For example:

(defprocessor :users
  :process-fn (fn [res ctx]
                [{:name "John", :surname "Doe"}])
  :skyscraper.db/columns [:name :surname])

(For the sake of clarity and focus, this document doesn’t rely on any particular site’s structure. Our example processors will return hardcoded data.)

Now, when you invoke Skyscraper like this:

(scrape! [{:url "http://example.com", :processor :users}]
         :db-file "/tmp/demo.sqlite")

Skyscraper will create a SQLite database in the given file, containing one table named users with four columns: two textual ones that you have specified, and two additional integer ones named id and parent. That is, it will conform to the following schema:

CREATE TABLE users (id integer primary key, parent integer, name text, surname text);

The id will be an internal, autogenerated primary key. It is not guaranteed to be stable – it is possible for two identical invocations of Skyscraper to generate different tables.

The parent column will be described below.

Tree structure

Let us expand our worked example. Consider the following processor definitions:

(defprocessor :users
  :process-fn (fn [res ctx]
                [{:name "John", :surname "Doe", :url "/", :processor :accounts}])
  :skyscraper.db/columns [:name :surname])

(defprocessor :accounts
  :process-fn (fn [res ctx]
                [{:bank-account "0123-4567"}
                 {:bank-account "8888-9999"}])
  :skyscraper.db/columns [:bank-account])

Running scrape! as above will now generate the following database:

sqlite> select * from users;
     id = 1
 parent =
   name = John
surname = Doe

sqlite> select * from accounts;
          id = 1
      parent = 1
bank_account = 0123-4567

          id = 2
      parent = 1
bank_account = 8888-9999

Because, in the scrape tree, nodes corresponding to the :accounts processor are children of those of :users, the parent column in the account user table references the id in users.

Note that this database doesn’t contain redundant data, but you can still easily obtain user data for each account by simply JOINing the tables together.

Key columns and updating

There’s a gotcha: if you re-run Skyscraper with the above settings, it will duplicate the already existing records in the database. This is because, normally, there is no way tell whether the newly-scraped records correspond to data we already have or not. For example, the records may differ in some details (e.g., timestamps), but still refer to the same entity.

Therefore, you have to be explicit about which fields uniquely identify a context for a given DB-enabled processor. For instance:

(defprocessor :users
  :process-fn (fn [res ctx]
                [{:name "John", :surname "Doe", :phone "123-4567"}])
  :skyscraper.db/columns [:name :surname :phone]
  :skyscraper.db/key-columns [:name :surname])

In this case, rather than bluntly executing an INSERT for each encountered row, Skyscraper will only insert the row when it doesn’t already exist in the DB. If it does exist, Skyscraper will update it when necessary. For example, imagine John’s phone number changes:

(defprocessor :users
  :process-fn (fn [res ctx]
                [{:name "John", :surname "Doe", :phone "765-4321"}])
  :skyscraper.db/columns [:name :surname :phone]
  :skyscraper.db/key-columns [:name :surname])

A repeated invocation of Skyscraper will now produce a database with the updated record.

Tips and caveats

Often, you have a paginated structure where a page contains a number of records you’re interested in, plus a link to the next page. A natural approach is to have the processor return the "next-page" context along with the actual records. So the return value of scraping /page/1 might look like this:

[{:name "John Doe", :url "/person/1", :processor :person}
 {:name "Jane Smith", :url "/person/2", :processor :person}
 ;; more records
 {:page 2, :url "/page/2", :processor :page}]

If you DB-enable such a processor (presumably with the :name column), Skyscraper will duly emit a null-name row for the last context. There are several ways to cope with this:

  • You could change your scraping structure, having a top-level processor that just scans the pagination producing links to pages, and the page processor returning just the records of interest.
  • Alternatively, you could just bite the bullet and accept this situation, remembering to insert NOT NULL in your queries where appropriate. In general, it is a good idea to treat the Skyscraper-generated database as an interim step of your scraping flow, and have a cleaning step further downstream (the data structure you’re trying to recreate will sometimes not correspond faithfully to the scraping structure).

Can you improve this documentation?Edit on GitHub

cljdoc is a website building & hosting documentation for Clojure/Script libraries

× close