Python libraries for web scraping
- philips
2011.10.07 8:21
- Python( http://www.python.org/) is a very simple, powerful programming language. FMiner( http://www.fminer.com/) is developed by python, and it use PySide( http://www.pyside.org/) doing the core scraping features. In addition to PySide, python has many libraries for web scraping(screen scraping), this article will list those common python libraries for web extraction.
Web scraping framework
Scrapy: http://scrapy.org/
Scrapy is a fast high-level web crawling and web scraping(screen scraping) framework, used to crawl websites, parse and extract structured data from their pages. It can be used for a wide range of purposes, such as data mining, automated testing and sites monitoring.
Page downloading libraries
urllib: http://docs.python.org/library/urllib.html
urllib2: http://docs.python.org/library/urllib2.html
They are standard libraries in python, can do the general jobs for downloading web pages.
PycURL: http://pycurl.sourceforge.net/
PycURL is a Python interface to libcurl, and it can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL's core is libcurl and made by C language, so it's fast, very fast, and supports a lot of features.
mechanize: http://wwwsearch.sourceforge.net/mechanize/
Stateful programmatic web browsing in Python, it can simulate web browser, but it does not use a real browser core, and can not handle javascript code.
twill: http://twill.idyll.org/
Twill is a simple language that allows users to browse through the web from a command-line interface. With twill, you can navigate through Web sites that use forms, cookies, and most standard web features. Twill supports automated web testing and has a simple Python interface.
Page parser
BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen scraping. It's very easy using for some small python web scraping projects. Its selection work likes Query.
lxml: http://lxml.de/
The lxml XML toolkit is a Pythonic binding on the C libraries libxml2 and libxslt. lxml.html can parse a html page to a dom tree, select the dom using XPath. Early versions of FMiner use it as a core module, but in order to deal with the page that contains javascript code, it was replaced with PySide.
re: http://docs.python.org/library/re.html
Regular Expression, it is a standard library in python. You can use regular expression to extract the page contents, but the writing a regular expression is very complex.
Browser core
PyQt: http://www.riverbankcomputing.co.uk/software/pyqt/intro
PyQt is a set of Python bindings for Nokia's Qt application framework, and it developed a long time, very mature. It contain the Webkit package which can browse through web pages and do web extraction. It has the GNU GPL (v2 and v3) and a commercial license.
PySide: http://www.pyside.org/
The PySide project provides LGPL-licensed Python bindings for the Qt cross-platform application and UI framework. It also contains webkit package and support LGPL. That's why FMiner chooses it.
Pamie: http://pamie.sourceforge.net/
stands for Python Automated Module For I.E.
Pamie's main use is for testing web sites by which you automate the Internet Explorer client using the Pamie scripting language. It uses IE com as the core, and main for testing web, to make screen scraping, you should do some more work to extract the page's content, and some javascript code is needed.