• Scrapy安装及使用


    本文介绍了常见的网络爬虫工具*Scrapy的安装及使用过程*,另外介绍了Scrapy运行时常见问题以及相应解决办法,希望能对您的学习带来帮助。

    Scrapy简介

    Scrapy是一个快速高级屏幕抓取和爬行网页的框架,用来抓取的网站,从网页中抽取结构化的数据。它可以用于广泛的用途,从数据挖掘到监控和自动化测试。

    官方主页: http://www.scrapy.org/

    安装Python2.7

    官方主页:http://www.python.org/

    下载地址:http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi

    安装python

    安装目录:D:Python27

    添加环境变量

    略System Properties -> Advanced -> Environment Variables - >System Variables -> Path -> Edit

    验证环境变量

    T:>set Path
    Path=C:WINDOWSsystem32;C:WINDOWS;C:WINDOWSSystem32Wbem;D:Rationalcommon;D:RationalClearCasein;D:Python27;D:Python27Scripts
    PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH

    验证Python

    T:>python
    Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> exit()
    
    T:>
    

    安装Scrapy

    官方主页:http://scrapy.org/

    下载地址:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz

    解压过程:略

    安装过程:

    T:Scrapy-0.14.4>python setup.py install
    
    ……
    Installing easy_install-2.7-script.py script to D:Python27Scripts
    Installing easy_install-2.7.exe script to D:Python27Scripts
    Installing easy_install-2.7.exe.manifest script to D:Python27Scripts
    
    Using d:python27libsite-packages
    Finished processing dependencies for Scrapy==0.14.4
    
    T:Scrapy-0.14.4>

    验证安装:

    T:>scrapy
    Scrapy 0.14.4 - no active project
    
    Usage:
      scrapy <command> [options] [args]
    
    Available commands:
      fetch         Fetch a URL using the Scrapy downloader
      runspider     Run a self-contained spider (without creating a project)
      settings      Get settings values
      shell         Interactive scraping console
      startproject  Create new project
      version       Print Scrapy version
      view          Open URL in browser, as seen by Scrapy
    
    Use "scrapy <command> -h" to see more info about a command
    
    T:>

    生成项目

    scrapy提供一个工具来生成项目,生成的项目中预置了一些文件,用户需要在这些文件中添加自己的代码。
    打开命令行,执行:scrapy startproject tutorial,生成的项目类似下面的结构
    tutorial/
    scrapy.cfg
    tutorial/
    init.py
    items.py
    pipelines.py
    settings.py
    spiders/
    init.py

    scrapy.cfg是项目的配置文件
    用户自己写的spider要放在spiders目录下面,建立一个dmoz.py文件,如下图

    内容如下:

    from scrapy.spider import BaseSpider
    class DmozSpider(BaseSpider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://pangjiuzala.github.io/"
    
        ]
        def parse(self, response):
            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

    name属性很重要,不同spider不能使用相同的name
    start_urls是spider抓取网页的起始点,可以包括多个url
    parse方法是spider抓到一个网页以后默认调用的callback,避免使用这个名字来定义自己的方法。
    当spider拿到url的内容以后,会调用parse方法,并且传递一个response参数给它,response包含了抓到的网页的内容,在parse方法里,你可以从抓到的网页里面解析数据。上面的代码只是简单地把网页内容保存到文件。
    开始抓取
    你可以打开命令行,进入生成的项目根目录tutorial/,执行 scrapy crawl dmoz, dmoz是spider的name。
    解析网页内容
    scrapy提供了方便的办法从网页中解析数据,这需要使用到HtmlXPathSelector

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    class DmozSpider(BaseSpider):
        name = "dmoz"
        allowed_domains = ["dmoz.org"]
        start_urls = [
            "http://pangjiuzala.github.io/"
    
        ]
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select('//ul/li')
            for site in sites:
                title = site.select('a/text()').extract()
                link = site.select('a/@href').extract()
                desc = site.select('text()').extract()
                print title, link, desc

    HtmlXPathSelector使用了Xpath来解析数据
    //ul/li表示选择所有的ul标签下的li标签
    a/@href表示选择所有a标签的href属性
    a/text()表示选择a标签文本
    a[@href=”abc”]表示选择所有href属性是abc的a标签
    我们可以把解析出来的数据保存在一个scrapy可以使用的对象中,然后scrapy可以帮助我们把这些对象保存起来,而不用我们自己把这些数据存到文件中。我们需要在items.py中添加一些类,这些类用来描述我们要保存的数据

    from scrapy.item import Item, Field
    class DmozItem(Item):
       title = Field()
       link = Field()
       desc = Field()
    

    然后在spider的parse方法中,我们把解析出来的数据保存在DomzItem对象中。

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from tutorial.items import DmozItem
    class DmozSpider(BaseSpider):
       name = "dmoz"
       allowed_domains = ["dmoz.org"]
       start_urls = [
           "http://pangjiuzala.github.io/       
       ]
       def parse(self, response):
           hxs = HtmlXPathSelector(response)
           sites = hxs.select('//ul/li')
           items = []
           for site in sites:
               item = DmozItem()
               item['title'] = site.select('a/text()').extract()
               item['link'] = site.select('a/@href').extract()
               item['desc'] = site.select('text()').extract()
               items.append(item)
           return items

    在命令行执行scrapy的时候,我们可以加两个参数,让scrapy把parse方法返回的items输出到json文件中
    scrapy crawl dmoz -o items.json -t json
    items.json会被放在项目的根目录
    让scrapy自动抓取网页上的所有链接
    上面的示例中scrapy只抓取了start_urls里面的两个url的内容,但是通常我们想实现的是scrapy自动发现一个网页上的所有链接,然后再去抓取这些链接的内容。为了实现这一点我们可以在parse方法里面提取我们需要的链接,然后构造一些Request对象,并且把他们返回,scrapy会自动的去抓取这些链接。代码类似:

    class MySpider(BaseSpider):
        name = 'myspider'
        start_urls = (
            'http://example.com/page1',
            'http://example.com/page2',
            )
        def parse(self, response):
            # collect `item_urls`
            for item_url in item_urls:
                yield Request(url=item_url, callback=self.parse_item)
        def parse_item(self, response):
            item = MyItem()
            # populate `item` fields
            yield Request(url=item_details_url, meta={'item': item},
                callback=self.parse_details)
        def parse_details(self, response):
            item = response.meta['item']
            # populate more `item` fields
            return item

    parse是默认的callback, 它返回了一个Request列表,scrapy自动的根据这个列表抓取网页,每当抓到一个网页,就会调用parse_item,parse_item也会返回一个列表,scrapy又会根据这个列表去抓网页,并且抓到后调用parse_details
    为了让这样的工作更容易,scrapy提供了另一个spider基类,利用它我们可以方便的实现自动抓取链接. 我们要用到CrawlSpider

    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    class MininovaSpider(CrawlSpider):
        name = 'mininova.org'
        allowed_domains = ['mininova.org']
        start_urls = ['http://www.mininova.org/today']
        rules = [Rule(SgmlLinkExtractor(allow=['/tor/d+'])),
                 Rule(SgmlLinkExtractor(allow=['/abc/d+']), 'parse_torrent')]
        def parse_torrent(self, response):
            x = HtmlXPathSelector(response)
            torrent = TorrentItem()
            torrent['url'] = response.url
            torrent['name'] = x.select("//h1/text()").extract()
            torrent['description'] = x.select("//div[@id='description']").extract()
            torrent['size'] = x.select("//div[@id='info-left']/p[2]/text()[2]").extract()
            return torrent

    相比BaseSpider,新的类多了一个rules属性,这个属性是一个列表,它可以包含多个Rule,每个Rule描述了哪些链接需要抓取,哪些不需要。这是Rule类的文档http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
    这些rule可以有callback,也可以没有,当没有callback的时候,scrapy简单的follow所有这些链接.
    pipelines.py的使用
    在pipelines.py中我们可以添加一些类来过滤掉我们不想要的item,把item保存到数据库。

    from scrapy.exceptions import DropItem
    class FilterWordsPipeline(object):
        """A pipeline for filtering out items which contain certain words in their
        description"""
        # put all words in lowercase
        words_to_filter = ['politics', 'religion']
        def process_item(self, item, spider):
            for word in self.words_to_filter:
                if word in unicode(item['description']).lower():
                    raise DropItem("Contains forbidden word: %s" % word)
            else:
                return item

    如果item不符合要求,那么就抛一个异常,这个item不会被输出到json文件中。
    要使用pipelines,我们还需要修改settings.py
    添加一行
    ITEM_PIPELINES = [‘dirbot.pipelines.FilterWordsPipeline’]
    现在执行scrapy crawl dmoz -o items.json -t json,不符合要求的item就被过滤掉了,这时在tutorial目录下会生成一个如下图所示的items.json文件

    将数据保存到mysql数据库

    数据库建表语句

    create table book ( title char(15) not null, link varchar(50) COLLATE gb2312_chinese_ci DEFAULT NULL);

    中文乱码问题

    如果出现中文乱码问题请将数据库编码格式设置成gb2312_chinese_ci

    配置pipelines.py

    添加如下代码:

    from scrapy import log
    from twisted.enterprise import adbapi
    from scrapy.http import Request
    from scrapy.exceptions import DropItem
    from scrapy.contrib.pipeline.images import ImagesPipeline
    import time
    import MySQLdb
    import MySQLdb.cursors
    import socket
    import select
    import sys
    import os
    import errno
    
    class MySQLStorePipeline(object):
        def __init__(self):
            self.dbpool = adbapi.ConnectionPool('MySQLdb',
                db = 'test',8 #数据库名称
                user = 'root', #数据库用户名
                passwd = '',  #数据库密码
                cursorclass = MySQLdb.cursors.DictCursor,
                charset = 'utf8',
                use_unicode = False
            )
    
        def process_item(self, item, spider):
            query = self.dbpool.runInteraction(self._conditional_insert, item)
            return item
    
        def _conditional_insert(self, tx, item):
            if item.get('title'):
                for i in range(len(item['title'])):
                    tx.execute('insert into book values (%s, %s)', (item['title'][i], item['link'][i]))

    配置setting.py

    添加如下代码:

    ITEM_PIPELINES = ['tutorial.pipelines.MySQLStorePipeline']
    

    执行scrapy crawl dmoz

    运行效果如下

    Python脚本运行出现语法错误

    解决方案:
    http://www.crifan.com/python_syntax_error_indentationerror/comment-page-1/

    生成的json文件编码默认为Unicode

    如下:

    [{"title": ["u4e3bu9875"], "tag": [], "link": ["/"], "desc": []},
    {"title": ["u6587u7ae0u5217u8868"], "tag": [], "link": ["/archives"], "desc": []},
    {"title": [], "tag": [], "link": [], "desc": ["
     					
    					", "
    					
    					"]},
    {"title": ["Java"], "tag": [], "link": ["/tags/Java/"], "desc": []},
    {"title": ["u7b97u6cd5"], "tag": [], "link": ["/tags/u7b97u6cd5/"], "desc": []},
    {"title": ["u6570u636eu6316u6398"], "tag": [], "link": ["/tags/u6570u636eu6316u6398/"], "desc": []},
    {"title": ["u7269u8054u7f51"], "tag": [], "link": ["/tags/u7269u8054u7f51/"], "desc": []},
    {"title": ["C++"], "tag": [], "link": ["/tags/C/"], "desc": []},
    {"title": ["openHAB"], "tag": [], "link": ["/tags/openHAB/"], "desc": []},
    {"title": ["u4e91u8ba1u7b97"], "tag": [], "link": ["/tags/u4e91u8ba1u7b97/"], "desc": []},
    {"title": ["C"], "tag": [], "link": ["/tags/C/"], "desc": []},
    {"title": ["u79fbu52a8u4e92u8054u7f51"], "tag": [], "link": ["/tags/u79fbu52a8u4e92u8054u7f51/"], "desc": []},
    {"title": ["GC"], "tag": [], "link": ["/tags/GC/"], "desc": []},
    {"title": ["u5927u6570u636e"], "tag": [], "link": ["/tags/u5927u6570u636e/"], "desc": []},
    {"title": ["u5faeu535a"], "tag": [], "link": ["http://weibo.com/jiayou087"], "desc": ["
                
                	", "
                
              "]},
    {"title": ["CSDN"], "tag": [], "link": ["http://blog.csdn.net/pangjiuzala"], "desc": ["
                
                	", "
                
              "]},
    {"title": ["July 2015"], "tag": [], "link": ["/archives/2015/07/"], "desc": []}]

    解决方案:

    配置pipelines.py文件

    添加如下代码;

    import json
    import codecs
    
    class JsonWriterPipeline(object):
    
        def __init__(self):
            self.file = codecs.open('items.json', 'w', encoding='utf-8')
    
        def process_item(self, item, spider):
            line = json.dumps(dict(item)) + "
    "
            self.file.write(line.decode('unicode_escape'))
            return item

    配置settings.py文件

    添加如下代码;

    ITEM_PIPELINES = {'tutorial.pipelines.JsonWriterPipeline'}
    

    转化后的数据如下:

    [{"title": ["主页"], "tag": [], "link": ["/"], "desc": []},
    {"title": ["文章列表"], "tag": [], "link": ["/archives"], "desc": []},
    {"title": [], "tag": [], "link": [], "desc": ["
     					
    					", "
    					
    					"]},
    {"title": ["Java"], "tag": [], "link": ["/tags/Java/"], "desc": []},
    {"title": ["算法"], "tag": [], "link": ["/tags/算法/"], "desc": []},
    {"title": ["数据挖掘"], "tag": [], "link": ["/tags/数据挖掘/"], "desc": []},
    {"title": ["物联网"], "tag": [], "link": ["/tags/物联网/"], "desc": []},
    {"title": ["C++"], "tag": [], "link": ["/tags/C/"], "desc": []},
    {"title": ["openHAB"], "tag": [], "link": ["/tags/openHAB/"], "desc": []},
    {"title": ["云计算"], "tag": [], "link": ["/tags/云计算/"], "desc": []},
    {"title": ["C"], "tag": [], "link": ["/tags/C/"], "desc": []},
    {"title": ["移动互联网"], "tag": [], "link": ["/tags/移动互联网/"], "desc": []},
    {"title": ["GC"], "tag": [], "link": ["/tags/GC/"], "desc": []},
    {"title": ["大数据"], "tag": [], "link": ["/tags/大数据/"], "desc": []},
    {"title": ["微博"], "tag": [], "link": ["http://weibo.com/jiayou087"], "desc": ["
                
                	", "
                
              "]},
    {"title": ["CSDN"], "tag": [], "link": ["http://blog.csdn.net/pangjiuzala"], "desc": ["
                
                	", "
                
              "]},
    {"title": ["July 2015"], "tag": [], "link": ["/archives/2015/07/"], "desc": []}]

    使用Beautiful Soup

    详情链接:
    http://kevinkelly.blog.163.com/blog/static/21390809320133185748442/

  • 相关阅读:
    安卓逆向前置之JAVA学习
    @Controller VS @RestController @RequestBody VS @ResponseBody
    【ArangoDb踩坑】ArangoDb中的大数比较
    我的快排
    记录改造ffmpeg遇到的依赖库问题
    centos7 配置阿里yum源
    记录一个解决GLIBC_2.18 not found的问题
    js 格式化时间 供页面使用
    5G PDU session Establishment
    DPDK performance for USER application
  • 原文地址:https://www.cnblogs.com/ainima/p/6331875.html
Copyright © 2020-2023  润新知