Using crawdad
crawdad is a simple, yet powerful alternative for scraping in a distributed, persistent manner (backed by Redis). It can do simple things, like generating site maps. It can also do complicated things, like extracting all the quotes from every page on a quotes website (tutorial follows).
Install
First get Docker which will be used for running Redis.
Then you can simply download crawdad:
$ wget https://github.com/schollz/crawdad/releases/download/v3.0.0/crawdad_3.0.0_linux_amd64.zip
$ unzip crawdad*zip
$ sudo mv crawdad*amd64 /usr/local/bin/crawdad
Unlike many other scraping frameworks, crawdad is a single binary that has no dependencies.
Configure
For scraping, crawdad requires creating a pluck configuration file. Here is the configuration file for scraping quotes.toscrape.com:
pluck is a language-agnostic way of extracting structured data from text without HTML/CSS/Regex. Essentially pluck is configured in a way you would tell your friend to grab data.
For example, the first pluck unit describes how you would get the quote text from quotes.toscrape.com. Starting from the beginning of the source, you look for the string “span class="text"
” (called an activator). Once that is found, you look for a “>
”, the next activator. Then you capture all the characters until a “<
” is seen (the deactivator). This will allow you to collect all the quotes.
Each of the pluck units will be found simultaneously and extracted from any HTML page crawled by crawdad.
Run
First, start Redis with Docker:
$ docker run -d -p 6374:6379 redis
and then start crawdad:
$ time crawdad -p 6374 -set -u http://quotes.toscrape.com -pluck quotes.toml -include '/page/' -exclude '/tag/'
0.12s user 0.03s system 5% cpu 2.666 total
The -set
flag tells the crawdad to create some new settings with a URL (-u
) and a pluck configuration (-pluck
) and with some inclusions/exclusions (-include
/-exclude
). The inclusions and exclusions ensures that only the /page
links will be followed (in order to compare with scrapy).
Extract data
The data from crawdad can be parsed in the same as scrapy by first dumping it,
$ crawdad -p 6374 -done done.json
The data, done.json
, contains each URL as a key and the data it extracted. It needs to be quickly parsed, too, which can be done lickity-split in Python in 12 lines of code:
crawdad bonuses
crawdad has some other mighty benefits as well. Once initiated, you can run another crawdad on a different machine:
$ crawdad -p 6374 -s X.X.X.X
This will start crawling using the same parameters as the first crawdad, but will pull from the queue. Thus, you can easily make a distributed crawler.
Also, since it is backed by Redis, crawdad is resilient to interruptions and allows you to restart from the point that it was interrupted. Try it!
Comparison with scrapy
Here I will compare scraping the same site, quotes.toscrape.com with crawdad (my creation) and scrapy (the popular framework for scraping).
scrapy is powerful, but complicated. Lets follow the tutorial to get a baseline on how a scrapy should run.
Install
First install scrapy by installing the dependencies (there are a lot of dependencies).
$ sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
$ sudo -H python3 -m pip install --upgrade scrapy
Once you get it install you can check the version:
$ scrapy -version
Scrapy 1.4.0 - project: quotesbot
Configure
Actually, I will just use the tutorial of scrapy to skip building it myself.
$ git clone https://github.com/scrapy/quotesbot.git
$ cd quotesbot
scrapy is not simple. It requires > 40 lines of Python code in several different files (items.py
, pipelines.py
, settings.py
, spiders/toscrape-css.py
).
Run
Lets run and time the result:
$ time scrapy crawl toscrape-xpath -o quotes.json
1.06s user 0.08s system 29% cpu 3.877 total
scrapy is about 10-30% slower than crawdad, plus it can not easily be run in a distributed, persistent way.
Written on 11 October 2017. Categories: coding, golang.
|
|