Tutorials‎ > ‎

Background Web Scraping using Delayed Job and Nokogiri in Ruby on Rails

posted May 9, 2015, 8:43 AM by Hadi Setiawan   [ updated Aug 15, 2016, 11:26 PM by Surya Wang ]

Web scraping is used when you want to access a site's resource, but there's no available API. Web scraping can be achieved in many ways, for example you can just copy and paste the content, or if there's a lot of data that needs to be gathered, you may want to automate the process. For example you may want to get tutorial information that's available on this site, as there's no API available, you can scrape the web instead.

But before we go any further, some of you may wonder whether it's ethical or not to do so. Well, I suggest that you consult the term of use of your targeted website first before begin scraping the data. For the sake of learning, we'll choose this very site as our target for data scraping.

Data that will be collected from this site is the tutorials, which is available at here. Specifically, we'll retrieve the tutorial title, content, published date, and author, which should be available for each tutorial in this site.

If you are wondering how we'll be automate the scrape process, worry not, it's actually pretty simple, as long as you have some knowledge about CSS or XPath selector, and of course about Rails framework too. The process is just like how we usually would use a site, but this time there will be a HTTP client which retrieve the HTML from the web and then we retrieve the specific data using the CSS or XPath selector.

As it will take some time to scrape all the data, we'll make the process to run in background. To achieve this, we'll be using a gem called Delayed Job. We use this gem because it allows us to make any code to run in background, making it really easy to use.

Preparation

We'll generate the model, view, and controller using rails generator. After that we'll configure the route too. But, first thing we must do is to create a new rails project.

Command Line
D:\> rails new railstutorial

After the project was created, we'll add the Delayed Job gem. To do so, edit the gemfile located inside the tutorial folder. Gemfile is used to indicate which gem you would like to use for this project. Inside the file, you'll find a bunch of gem listed, these are the default dependencies. To add additional dependencies, append the gemname and optionally the version to the file. It's a good practice to limit the version of the new gem, as we don't want our application to break if the newer version of gem is not backward compatible.

gemfile
...
gem 'delayed_job_active_record', '~> 4.0'

As seen above, we add a gem called delayed_job_active_record, and limit the version to 4.0.*. Now save the gemfile, and let's install the new gem using bundler.

Command Line
D:\railstutorial> bundle install

Bundler works as a dependency manager for ruby projects, if you've used PHP, Bundler to Ruby is like Composer to PHP. Let's try and start the rails server and the default landing page should be there.

Model

There are three models that we need, Tutorial, Author, and another one needed for Delayed Job. Tutorial model will have author, title, content, and published date. Let's use rails generator to generate the boilerplate for all model, including the model for Delayed Job.

Command Line
D:\railstutorial> rails generate model author name:string
D:\railstutorial> rails generate model tutorial author:references title:string content:text{16383} published_at:datetime
D:\railstutorial> rails generate delayed_job:active_record

Thanks to the great rails generator, all the required model for this tutorial is created. Not only the model, rails generator also creates migration files for these models. Let's run the migration so that the related table will be created.

Command Line
D:\railstutorial> rake db:migrate

Invoking the command creates table for the models. The default database configuration uses sqlite, the database is located in the db folder.

View and Controller

After generating the model, we'll use the rails generator to generate a controller, the TutorialController. This controller will have 3 action, index to list all tutorials, view to view specific tutorial, and queue to queue a job for scraping data.

Command Line
D:\railstutorial> rails generate controller Tutorial index view queue

By invoking the command above, rails should generate the tutorial controller, default views, and other files. The tutorial controller should also have three actions which is only a stub for now. Route configuration should also have been set up when we generate the controller.

Route Configuration

We'll configure the route to be simpler than the default {controller}/{action} convention. Below is the routing configuration that we want.

config\routes.rb
1
2
3
4
5
6
7
Rails.application.routes.draw do
    get '/' => 'tutorial#index'

    get '/tutorial/:id' => 'tutorial#view'

    get '/queue' => 'tutorial#queue'
end

Scraping With Nokogiri And Delayed Job

After all the preparation is done, we'll now code the scraping logic, we'll divide the scraping logic into two parts, the first part is the one which will scrape the listing page, and the other is for retrieving the specific tutorial content. TutorialService will be used to scrape the listing page, and Tutorial model wil be used to scrape the individual tutorial.

app\services\tutorial_service.rb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
require 'open-uri'

class TutorialService
    def scrape_listing
        base_url = 'http://portal.bluejack.binus.ac.id'
        listing_url = '/system/app/pages/subPages?path=/tutorials&offset='
        page_no = 0
        page_count = 50
        has_next = false

        begin
            url = base_url + listing_url + (page_no * page_count).to_s
            listing = Nokogiri::HTML(open(url))
            page_no += 1

            listing.css('.sites-table a').each do |tutorial_el|
                scrape(base_url + (tutorial_el.attr :href))
            end

            # split pagination info, e.g. from "1-50 of 54" to ["1", "50", "54"]
            pagination_info = listing.at_css('.sites-pagination-info').content.split(/ of |-/)
            has_next = pagination_info[1].to_i < pagination_info[2].to_i
        end while has_next
    end

    def scrape url
        t = Tutorial.new.scrape url

        t.save
    end
    handle_asynchronously :scrape
end

The method scrape_listing will retrieve the listing, and then run the scraping of individual tutorial in background. We'll retrieve the listing as long as there's more page. We know whether there's more tutorial or not by comparing the paging information available at the bottom of the page. Luckily this element has an unique class, which is .sites-pagination-info.

Notice the code at line 31, it will make the method scrape run in background when it's invoked by scrape_listing. The method itself just create a new instance of Tutorial model and then invoke the scrape method and finally save it. Let's take a look at the scraping logic for individual tutorial below.

app\models\tutorial.rb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
require 'open-uri'

class Tutorial < ActiveRecord::Base
    belongs_to :author

    def scrape(url)
        title_selector = '#sites-page-title'
        author_selector = '//*[@id="afterPageTitleHideDuringEdit"]/text()[3]'
        content_selector = '#sites-canvas-main-content div[dir="ltr"] *'
        published_date_selector = '#afterPageTitleHideDuringEdit span'

        doc = Nokogiri::HTML(open(url), nil, 'utf-8')
        title = doc.at_css(title_selector).content
        content = doc.css(content_selector)
        published_at = doc.at_css(published_date_selector).content;
        author_name = doc.at_xpath(author_selector).content.gsub(/[[:space:]]/, ' ').strip[3..-1]

        tutorial = Tutorial.find_by(title: title)
        unless tutorial.nil?
            self.id = tutorial.id
            @new_record = false
        end

        author = Author.find_by(name: author_name)
        author = Author.create(name: author_name) if author.nil?

        self.title = title
        self.content = content
        self.author = author
        self.published_at = published_at

        self
    end
end

The scrape method receive one parameter, which is the url to scrape from. The key to scraping is using the correct selector to use. So remember that in the beginning I said this should not be hard if you have some knowledge about CSS or XPath selector. Let's take a look at the value of title_selector, it should select an element with an id of sites-page-title. The selectors should adapt to the webpage you're scraping. So different site means different selector.

The third parameter of Nokogiri::HTML is very important, it tells the parser what text format to use when decoding data received from invoking open(url). The encoding used by this site is UTF-8, so we'll use this encoding to decode the data.

Because we may need to scrape the same url again in the future (because the author may update the tutorial), we can't directly save the result to database. We should save the tutorial only if there is no tutorial with the same title in the database at the time. But otherwise, we will update the existing tutorial. Hence the code at line 18, we find the tutorial by title, if there exists a tutorial with that title, we set the id of current instance to equals the found tutorial and mark current instance as not a new record. This will make the save to update instead of inserting new row.

And finally at the end of the method, we return the instance itself making it possible to method chaining.

The Controller

We'll add more code to the generated controller. Index will list all tutorials, view will get tutorial with specified id, and queue will queue will get the latest tutorials.

app\controllers\tutorial_controller.rb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class TutorialController < ApplicationController
    def index
        @tutorials = Tutorial.all
    end

    def view
        @tutorial = Tutorial.find params[:id]
    end

    def queue
        TutorialService.new.delay.scrape_listing
        flash[:message] = 'Scraping queued!'
        redirect_to :action => :index
    end
end

The code is pretty self explanatory. Please note that for the queue action, we are redirecting it back to the index, and flash a session message containing information about the process.

The View

We'll modify the view to do show the tutorial listing and the content.

app\views\layouts\application.html.erb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<!DOCTYPE html>
<html>
<head>
    <title>Railstutorial</title>
    <%= stylesheet_link_tag    'application', media: 'all', 'data-turbolinks-track' => true %>
    <%= javascript_include_tag 'application', 'data-turbolinks-track' => true %>
    <%= csrf_meta_tags %>
</head>
<body>
    <header>
        <h1><a href="/">Tutorial</a></h1>
    </header>
    <%= yield %>
</body>
</html>

There isn't much change in the application layout, we just add a link to home for easier access.

app\views\tutorial\index.html.erb
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
<a href="/queue" class="button">Queue</a>
<% if flash[:message] %>
    <span class="message"><%= flash[:message] %></span>
<% end %>
<ul class="listing">
    <% @tutorials.each do |t| %>
    <li>
        <div><a href="/tutorial/<%= t.id %>"><%= t.title %></a></div>
        <div><%= t.author.name %>, last retrieved at <%= datetime t.updated_at %></div>
    </li>
    <% end %>
</ul>
app\views\tutorial\view.html.erb
1
2
3
4
5
<div class="tutorial">
    <h1 class="title"><%= @tutorial.title %></h1>
    <p class="author"><%= @tutorial.author.name %>, published at <%= date @tutorial.published_at %></p>
    <div class="content"><%= raw @tutorial.content %></div>
</div>

The method datetime and date called in index and view is defined in application helper file. The purpose of this method is to print out formatted date.

app\helpers\application_helper.rb
1
2
3
4
5
6
7
8
9
module ApplicationHelper
    def date d 
        d.strftime '%d %B %Y'
    end

    def datetime d
        d.strftime '%d %B %Y %I:%M %p'
    end
end

The view is done, but we'll be adding some styles to make them look a little bit better.

app\assets\stylesheets\application.css
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
body {
    font-size: 14px;
    font-family: arial, sans-serif;
}

a {
    text-decoration: none;
}
a:hover {
    text-decoration: underline;
}

.button {
    color: #fafafa;
    background-color: #E71354;
    border: 1px solid maroon;
    border-radius: 10%;
    padding: 5px 8px;
    box-shadow: 0px 1px 0px rgba(255, 255, 255, 0.3) inset, 0px 1px 1px rgba(100, 100, 100, 0.3);

}
.button:hover {
    text-decoration: none;
}

.message {
    padding: 5px 8px;
    color: #666;
    border: 1px solid #ccc;
    background-color: #eee;
}
app\assets\stylesheets\tutorial.scss
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.listing {
    list-style: none;
    padding: 0;

    li {
        margin: 10px 0;
        padding: 15px;
        background-color: #eee;
        border: 1px solid #ddd;

        a {
            font-size: 1.3em;
        }
    }
}

.tutorial {
    .content {
        border: 1px solid #eee;
        background-color: #fefefe;
        padding: 15px;
    }
}

Deployment

To start using this application, you need to start the rails server as usual. Make sure that you've migrated the database before though. Other than starting the server, you also need to start the Delayed Job worker process. To do so, run the two following commands in two seperate consoles.

Command Line
D:\railstutorial> rails server
D:\railstutorial> rake jobs:work

The worker process will handle any task in the background. It will check whether there's a task or not at a fixed time. And when a task failed to run, e.g. because of an exception was thrown, then it will increment the attempt counter. The higher it is, the longer it will be to be re-run.

Here's how the result looks like.

Tutorial listing
Tutorial listing
Tutorial content
Tutorial content
Queueing a job
Queueing a job
Delayed Job woker process
Delayed Job woker process

Conclusion

All in all, use scraping only if you want to use a site's resource and there's no API for it, and most importantly make sure that it does not violate the content owner's term of use. From the technical viewpoint, we can easily scrape content from a site this, especially in Rails framework. Delayed Job is also a wonderful gem, allowing us to run a lengthy task in the background easily.

I've provided the material used in this tutorial, you can download them in the attachment below.

ċ
railstutorial.zip
(36k)
Hadi Setiawan,
May 9, 2015, 8:44 AM