An implementation of a crawler for Ads.txt files written in Clojure.
IAB Tech Lab released a specification for Ads.txt files. See https://iabtechlab.com/adstxt/.
Along with the specification they released a reference crawler written in Python. The repository for that project is https://github.com/InteractiveAdvertisingBureau/adstxtcrawler.
This project demonstrates a crawler for Ads.txt files written in Clojure. As of this writing this project differs slighly from the Python project. The Python project defaults to saving it's data to a SQLite database. This project defaults to sending it's output to STDOUT and it's errors to STDERR . The project does support saving it's data to a SQLite database but it is optional.
Build the project with the
lein uberjar command.
$ lein uberjar
To use this project as a library in a Clojure project add the following to your :dependencies
$ java -jar adstxt-crawler-standalone.jar [options] [domains]
-t FILE, --targets=FILE
list of domains to crawler ads.txt from
-d FILE, --database=FILE
database to dump crawled data into
Optionally you can pass a series of domains to process on the command line
targets file is required and the
database file is optional. If you do not submit a database name the program will output it's data to STDOUT and it's errors to STDERR.
The targets file is simply a list of domains and URLS to crawl. For each line the crawler will extract the domain and make a request to
The data returned will be parsed to ignore blank and commented lines. Each valid line will be parsed according to the Ads.txt specification.
This project has optional support for saving data to a local SQLite database. To facilitate this install `sqlite3'.
Then to create the initial database run the following command.
$ sqlite3 database.db < ./sql/create.sql
After building the project using
lein uberjar pass the example
target-domains.txt file included in the docs directory using the
$ java -jar ./target/uberjar/adstxt-crawler-standalone.jar -t ./doc/target-domains.txt
For another larger example, see the file ./doc/top-100-programmatic-domains.txt in the doc directory.
To run this file you can either run the following command.
$ java -jar ./target/uberjar/adstxt-crawler-standalone.jar -t ./doc/top-100-programmatic-domains.txt >results.csv 2>err.log
Or run the
run-100.sh file in the scripts directory. The
run-100.sh will process the results and produce a few summary files.
Lastly, for those who want to just see some results, you can visit the following repository which contains the output files from a recent Top 100
Create your initial database using the following command. Here I'll create a database called adstxt.db
$ sqlite3 adstxt.db < ./sql/create.sql
Then to run the Topp 100 domains as an example use the following command.
$ java -jar ./target/uberjar/adstxt-crawler-standalone.jar -t ./doc/top-100-programmatic-domains.txt -d adstxt.db
You'll notice that your errors will still show but the data will be saved into the database.
To verify the database you can dump the table with the following command.
$ echo 'select * from adstxt;' | sqlite3 adstxt.db
Also, you can open the database with
$ java -jar ./target/uberjar/adstxt-crawler-standalone.jar washingtonpost.com ibm.com businessinsider.com
For background information on the project please review some recent blog posts on the project.
Copyright © 2017 Brad Lucas
Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.