• Ruby ScreenScraper in 60 Seconds


    Ruby Screen-Scraper in 60 Seconds - igvita.com

    Ruby Screen-Scraper in 60 Seconds

    I often find myself trying to automate content extraction from a saved HTML file or a remote server. I've tried a number of approaches over the years, but the dynamic duo of Hpricot and Firebug blew me away - this is by far the fastest way to get what you want without compromising flexibility. Hpricot is an extremely powerful ruby-based HTML parser, and Firebug is arguably the best on-the-fly development add-on for Firefox. Now, I said it will take you about 60 seconds. I lied, it should take less. Let's get right to it.
    Introducing open-uri

    Ruby comes with a very flexible, production ready library that wraps all http/https connections into a single method call: open. Among other things, open-uri will gracefully handle http redirects, allow you to specify custom headers, and even work with ftp addresses. In other words, all the dirty work is already done, but you should still check the RDoc. I'll let the code speak for itself:

    require 'rubygems'
    require 'open-uri'

    @url = "http://www.igvita.com/blog"
    @response = ''

    # open-uri RDoc: http://stdlib.rubyonrails.org/libdoc/open-uri/rdoc/index.html
    open(@url, "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "email@addr.com",
    "Referer" => "http://www.igvita.com/blog/") { |f|
    puts "Fetched document: #{f.base_uri}"
    puts "\\t Content Type: #{f.content_type}\\n"
    puts "\\t Charset: #{f.charset}\\n"
    puts "\\t Content-Encoding: #{f.content_encoding}\\n"
    puts "\\t Last Modified: #{f.last_modified}\\n\\n"

    # Save the response body
    @response = f.read
    }


    FireBug kung-foo

    Now that we have the document, we need to pull out some content that interests us - usually, this is the tedious part based on regular expressions, stream parsers, etc. Instead, we're going to sidestep all of these issues and let firebug do its magic. First, install the extension, then while on this page, click in the bottom-right corner of your browser to bring it up. It should ask you if you want to enable firebug (hint: say yes). You should now be greeted with the following screen:

    For the sake of an example, assume that we want to extract three things out of this very page: some quoted text (sample below), number of comments, and the list of my latest posts found at the bottom of this page. Here is an example of quoted text:

    So which came first, the parser, which will extract this, or this quote? - Extract me!

    In your firebug window, click "Inspect" and hover your mouse over the quote. You will notice that firebug navigates to the exact part of the DOM-tree (HTML source code) as you do this. When you put your mouse over the quote, you should see the following:

    Here's the trick, right click on the selected blockquote element in your firebug window and select "Copy XPath". This will provide you with the exact drill-down code for the DOM-Tree. In our case your clipboard should contain: "/html/body/div[2]/div/div/blockquote".
    Hpricot magic

    It is at this point that Hpricot comes into the picture, and you have probably guessed it already - it supports XPath. All we need to do is pass our HTML to it to build the internal tree, and then we're ready to go:

    #Rdoc: http://code.whytheluckystiff.net/hpricot/
    doc = Hpricot(@response)

    # Retrive number of comments
    # - Hover your mouse over the 'X Comments' heading at the end of this article
    # - Copy the XPath and confirm that it's the same as shown below
    puts (doc/"/html/body/div[3]/div/div/h2").inner_html

    # Pull out first quote (
    ....
    )
    # - Note that we don't have to use the full XPath, we can simply search for all quotes
    # - Because this function can return more than one element, we will only look at 'first'
    puts (doc/"blockquote/p").first.inner_html

    # Pull out all other posted stories and date posted
    # - This searh function will return multiple elements
    # - We are going to print the date, and then print the article name beside it
    (doc/"/html/body/div[4]/div/div[2]/ul/li/a/span").each do |article|
    puts "#{article.inner_html} :: #{article.next_node.to_s}"
    end



    As you can see I provided a few other examples, but the idea is simple. Open firebug, navigate to component you want to extract, copy XPath, paste it right into the search function of Hpricot and then print out the results. How simple is that? I should also mention that Hpricot is not limited to XPath, nor did my examples cover all the functionality of it, I strongly encourage you to check the official Hpricot page for more tips and tricks.

    Download
    screen-scraper.rb (Combined final code)

    Downloads: 4418 File Size: 1.2 KB

    Running our screen-scraper produces:

    Fetched document: http://www.igvita.com/blog/
    Content Type: text/html
    Charset: utf-8
    Content-Encoding:
    Last Modified:

    No Comments
    So which came first, the parser, which will extract this, or this quote? - Extract me!
    04.02 :: Ruby Screen-Scraper in 60 Seconds
    31.01 :: World News With Geographic Heatmaps
    27.01 :: Correlating Netflix and IMDB Datasets
    ...

    Copy, paste, done. Now you have no excuse to put off that custom RSS generator you always wanted.

    Regin Gaarsmand and Harish Mallipeddi posted PHP and Python equivalents of this method. Awesome!


  • 相关阅读:
    奇怪的html控件textarea
    ado.net快速上手实践篇(二)
    巧用apply让javascript函数仅执行一次
    javascript:像操作Array一样操作NodeList
    javascript下的数值型比较真的没有那么简单
    ado.net快速上手实践篇(一)
    ado.net快速上手疑问及解答(完结篇)
    如何利用【百度地图API】进行定位?非GPS定位
    【百度地图API】关于如何进行城市切换的三种方式
    【百度地图API】建立全国银行位置查询系统(四)——如何利用百度地图的数据生成自己的标注
  • 原文地址:https://www.cnblogs.com/lexus/p/2224797.html
Copyright © 2020-2023  润新知