QuarkRuby: Firequark : quick html screen scraping
Firequark : quick html screen scraping
Table of ContentsFirequark is an extension to Firebug to aid the process of HTML Screen Scraping. Firequark automatically extracts css selector for a single or multiple html node(s) from a web page using Firebug (a web development plugin for Firefox). The css selector generated can be given as an input to html screen scrapers like Scrapi to extract information. Firequark is built to unleash the power of css selector for use of html screen scraping.
HTML screen scraping is a common technique of extracting information about specific and useful elements from a web page. Independent of programming language, for extracting an element from a web page one need to know its exact location or a key to uniquely identify the element. There are two approaches for uniquely identifying an element: using XPath or CSS Selectors.
Firebug has an inbuilt functionality of generating XPath for an html element. Ilya Grigorik has written a good article on using XPath for HTML screen scraping. Whereas, Firequark extends Firebug for generating CSS Selector for elements on a web page.
Example case : Lets take a practical example where you want to scrape Amazon.com. My goal is to get product name, price and rating for all the products from the Amazon point-and-shoot camera catalog page. I will use this example in screencast and explanation below.
Why Firequark?
XPath vs. CSS Selector
When XPath is already provided by Firebug then why do I need CSS Selector? Xpath is great for scaping XML documents but for (x)html documents, it runs into many issues like :
- Various parsers generate different xpath for the same element depending on their handling of broken markup with badly nested tags, errors in html pages, custom tags etc.
- Firefox adds <tbody> tag in table nodes, independent of whether html page has tbody tag or not, which makes it difficult to figure out to keep <tbody> tag or not.
CSS Selector does not suffer from these problems because css selector of an html node is based on properties of self and neighboring nodes. Attributes of element and its neighbors like id, class, etc are used to find css selector for an element.
Its difficult to find css selector for a html node. Its a trial and error method until you find a right combination of rule (css selector). In the worst case the css selector would be the xpath itself. In our example case, its more difficult to manually find one unique css selector for 24 camera products and one each for their attributes like camera name, price, rating, etc (4 in total).
I am big fan of Scrapi, a html scraping toolkit in Ruby because it supports bundle scraping. Bundle scraping refers to extraction of multiple attributes of an object from a web page in one parse. Bundle scraping is well-defined at Qscraper: a hpricot interface to scrapi.
Continuing with the example case, object is a camera product (there are 24 such objects on the page) and price, rating, product name, etc are attributes of camera object. One way of extracting attributes is to separately get list of product name, price and rating from the web page and then combine these list. But, how do you combine them? What if one of the camera product is not rated or amazon does not provide its price?
Firequark is really powerful in solving this problem. Using Firequark, first get one css selector to identify all the camera objects on the page which contain the attribute information inside them (name, price and rating). Set camera object to parent and find css selector of attributes relative to the parent. Give these css selectors as an input to html screen scraper supporting bundle scraping like Scrapi and bingo! Our screencast below explains this case in detail.
Click here to view the screencast [format:avi, size:6.6MB]
Note : Avoid using CSS Selector functionality on first 2-3 products on a page because top 2-3 products are usually displayed differently like top sellers or top rated with same id which causes problems in getting good & simple css selectors.
Even in our screencast we analyze the 4th product on the page to get a simple css selector
Click here to install, this will overwrite your current Firebug installation (built on Firebug v1.05).
(FF3 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.2).
(FF3.5 users) Click here to install, this will overwrite your current Firebug installation (build on firebug 1.4.1 and firefox 3.5).
Note: in case you are doing firebug development in you working firefox then please backup your work before installing Firequark.
Firequark adds four new functions to each node element in html source tab of Firebug. In html source tab of Firebug, when you click on an html node:
Current menu New menu
- Get U CSS Selector: Get css selector for the selected element which uniquely identifies that element.
- Get CSS Selector: Get one css selector for group of elements on a web page. When you select this option, it will prompt you for how many such elements exist on this page. In our example case, right click on camera object node, in the prompt box put 24, as there are 24 camera objects on the page, press enter and you will get one unique css selector to extract 24 camera objects.
- Mark as parent: Used in bundle scraping to mark an object as parent. It is followed by Get U CSS Selector for attributes of parent, which will return you with css selector of attributes relative to parent object.
- Unmark parentnode: To unmark object as parent.