• Web Scraping Ajax and Javascript Sites « Data Big Bang Blog


    Web Scraping Ajax and Javascript Sites « Data Big Bang Blog
    Web Scraping Ajax and Javascript Sites
    Tweet

    Introduction

    Most crawling frameworks used for scraping cannot be used for Javascript or Ajax. Their scope is limited to those sites that show their main content without using scripting. One would also be tempted to connect a specific crawler to a Javascript engine but it’s not easy to do. You need a fully functional browser with good DOM support because the browser behavior is too complex for a simple connection between a crawler and a Javascript engine to work. There is a list of resources at the end of this article to explore the alternatives in more depth.

    There are several ways to scrape a site that contains Javascript:

    Embed a web browser within an application and simulate a normal user.
    Remotely connect to a web browser and automate it from a scripting language.
    Use special purpose add-ons to automate the browser
    Use a framework/library to simulate a complete browser.

    Each one of these alternatives has its pros and cons. For example using a complete browser consumes a lot of resources, especially if we need to scrape websites with a lot of pages.

    In this post we’ll give a simple example of how to scrape a web site that uses Javascript. We will use the htmlunit library to simulate a browser. Since htmlunit runs on a JVM we will use Jython, an [excellent] programming language,which is a Python implementation in the JVM. The resulting code is very clear and focuses on solving the problem instead of on the aspects of programming languages.
    Setting up the environment
    Prerequisites

    JRE or JDK.
    Download the latest version of Jython from http://www.jython.org/downloads.html.
    Run the .jar file and install it in your preferred directory (e.g: /opt/jython).
    Download the htmlunit compiled binaries from: http://sourceforge.net/projects/htmlunit/files/.
    Unzip the htmlunit to your preferred directory.

    Crawling example

    We will scrape the Gartner Magic Quadrant pages at: http://www.gartner.com/it/products/mq/mq_ms.jsp . If you look at the list of documents, the links are Javascript code instead of hyperlinks with http urls. This is may be to reduce crawling, or just to open a popup window. It’s a very convenient page to illustrate the solution.
    gartner.py
    01 import com.gargoylesoftware.htmlunit.WebClient as WebClient
    02 import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
    03
    04 def main():
    05 webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
    06 url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
    07 page = webclient.getPage(url) # getting the url
    08 articles = page.getByXPath("//table[@id='mqtable']//tr/td/a") # getting all the hyperlinks
    09
    10 for article in articles:
    11 print "Clicking on:", article
    12 subpage = article.click() # click on the article link
    13 title = subpage.getByXPath("//div[@class='title']") # get title
    14 summary = subpage.getByXPath("//div[@class='summary']") # get summary
    15 if len(title) > 0 and len(summary) > 0:
    16 print "Title:", title[0].asText()
    17 print "Summary:", summary[0].asText()
    18 # break
    19
    20 if __name__ == '__main__':
    21 main()
    run.sh
    1 /opt/jython/jython -J-classpath "htmlunit-2.8/lib/*" gartner.py
    Final notes

    This article is just a starting point to move ahead of simple crawlers and point the way for further research. As this is a simple page, it is a good choice for a clear example of how Javascript scraping works.You must do your homework to learn to crawl more web pages or add multithreading for better performance. In a demanding crawling scenario a lot of things must be taken into account, but this is a subject for future articles.

    If you want to be polite don’t forget to read the robots.txt file before crawling…
    Resources

    HtmlUnit
    Crowbar web scraping environment
    Google Chrome remote debugging shell from Python
    Selenium web application testing system – Watir – Sahi – Windmill Testing Framework
    Internet Explorer automation
    jSSh Javascript Shell Server for Mozilla
    http://trac.webkit.org/wiki/QtWebKit
    Embedding Gecko
    Opera Dragonfly
    PyAuto: Python Interface to Chromum’s automation framework
    Related questions on Stack Overflow
    Scrapy
    EnvJS: Simulated browser environment written in Javascript
    Setting up Headless XServer and CutyCapt on Ubuntu
    CutyCapt: Capture WebKit’s rendering of a web page.
    Google webmaste blog: A spider’s view of Web 2.0
    OpenQA
    Python Webkit DOM Bindings
    Berkelium Browser
    uBrowser
    Using HtmlUnit on .NET for Headless Browser Automation (using IKVM)
    Zombie.js
    PhantomJS
    PyPhantomJS
    Web Inspector Remote
    Offscreen/Headless Mozilla Firefox (via @brutuscat)
    Web Scraping with Google Spreadsheets and XPath
    Web Scraping with YQL and Yahoo Pipes

    Photo taken by xiffy


  • 相关阅读:
    CSS基础(十七)--Padding和margin(内边距和外边距)
    tomcat动态网站
    http和nginx错误定义
    nginx动态网站
    nginx动静分离
    nginx负载均衡
    nginx介绍
    cobbler服务器
    apache网络配置
    网络源
  • 原文地址:https://www.cnblogs.com/lexus/p/2225109.html
Copyright © 2020-2023  润新知