爬虫之Scrapy

Scrapy初步

Scrapy基于Twisted设计实现，Twisted的特殊特性是以事件驱动，并且对于异步的支持性很好，集成了高性能的异步下载，队列，分布式，持久化等。

Scrapy的安装

在Linux中可以直接在命令行中输入：pip install scrapy

在windows中：

　　- pip3 install wheel

　　- 下载twisted，http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

　　　　- 下载的Twisted一定要和自己当前python解释器的版本相匹配，不然不会报这个错误：

　　　　　　Twisted-18.9.0-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.

　　- 进入下载目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl

　　- pip3 install pywin32

　　- pip3 install scrapy

Scrapy的目录结构

使用 scrapy startproject projectname 启动项目。

　　- scrapy.cfg　　scrapy项目的基础配置

　　- items.py　　设置数据存储模板，用于结构化数据，如Django的Model

　　- middlewares.py　　自己定义的中间件

　　- piplines　　数据的持久化处理

　　- settings.py　　配置文件，如：递归的层数、并发数、延迟下载等

　　- spiders　　爬虫目录，创建文件，编写解析规则等

创建爬虫应用程序：

　　- 进入项目目录

　　- scrapy genspider 应用名称爬取网页的起始url　　（例如: scrapy genspider test_1 www.baidu.com）

　　- 会出现一个appname.py的文件，文件内容如下

import scrapy


class Test1Spider(scrapy.Spider):
    name = 'test_1'　　#应用名
    allowed_domains = ['www.baidu.com']　　#允许爬取的域名，如果非该域名的则跳过
    start_urls = ['http://www.baidu.com/']　　#起始爬取的url
　　#访问起始url并获取结果后的回调函数，该函数的response参数就是向起始的url发送请求后，获取的相应对象，
　　该函数的返回值必须为可迭代对象获知NULL

def parse(self, response):
        pass　　#response.text　　获取字符串类型的相应内容
　　　　　　　　　#reponse.body　　获取字节类型的相应内容

程序执行：

　　- scrapy crawl 爬虫名称

　　- scrapy crawl --nolog　　#不显示执行的日志信息

案例一

修改settings.py文件

USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360EE"
ROBOTSTXT_OBEY=False    #忽略君子协议

打开sprider文件

import scrapy

class QiuSprider(scrapy.Sprider):
    name="qiusp"
    allowed_domains=['https://www.qiushibaike.com/']
    start_urls=['https://www.qiushibaike.com/']

    def parse(self,response):
        #xpath为response函数的方法
        odiv=response.xpath('//div[@id="content-left"]/div')
        content_list=[]    #存储解析数据
        for div in odiv:
            #xpath函数返回列表，列表中的数据为selector类型，需要调用extract()函数取出数据
　　　　　　　author = div.xpath('.//div[@class="author clearfix"]/a/h2/text()')[0].extract()
　　　　　　　content=div.xpath('.//div[@class="content"]/span/text()')[0].extract() 
　　　　　　　#将解析内容封装到字典中
　　　　　　　dic={ '作者':author, '内容':content }
　　　　　　　#将数据存到content_list中
　　　　　　　content_list.append(dic)
　　　　　return content_list

案例二

spider

import scrapy

from bossPro.items import BossproItem
class BossSpider(scrapy.Spider):
    name = 'boss'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']

    url = 'https://www.zhipin.com/c101010100/?query=python爬虫&page=%d&ka=page-2'
    page = 1
    #解析+管道持久化存储
    def parse(self, response):
        li_list = response.xpath('//div[@class="job-list"]/ul/li')
        for li in li_list:
            job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()').extract_first()
            salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first()
            company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first()

            #实例化一个item对象
            item = BossproItem()
            #将解析到的数据全部封装到item对象中
            item['job_name'] = job_name
            item['salary'] = salary
            item['company'] = company

            #将item提交给管道
            yield item

        if self.page <= 3:
            print('if 执行!!!')
            self.page += 1
            new_url = format(self.url%self.page)
            print(new_url)
            #手动请求发送
            yield scrapy.Request(url=new_url,callback=self.parse)

items.py

items.py存放的是我们要爬取数据的字段信息，我们要爬取的是工作名，薪资，公司名。

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BossproItem(scrapy.Item):
    # define the fields for your item here like:
    job_name = scrapy.Field()
    salary = scrapy.Field()
    company = scrapy.Field()

pipelines.py

pipelines.py主要是对spiders中爬虫返回的数据进行处理的，在这里我们让其写入redis和写入文件，

pipeline可以随意定义，但是它是有顺序的，所以我们要在settings.py设置权重，数字越小，优先级越高。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from redis import Redis
class BossproPipeline(object):
    fp = None
    def open_spider(self, spider):
        print('开始爬虫......')
        self.fp = open('./boss.txt','w',encoding='utf-8')
    def close_spider(self, spider):
        print('结束爬虫......')
        self.fp.close()
    #爬虫文件每向管道提交一次item,则该方法就会被调用一次.
    #参数:item 就是管道接收到的item类型对象

    def process_item(self, item, spider):
        #print(item)
        self.fp.write(item['job_name']+':'+item['salary']+':'+item['company']+'
')
        return item #返回给下一个即将被执行的管道类

class redisPileLine(object):
    conn = None
    def open_spider(self,spider):
        self.conn = Redis(host='127.0.0.1',port=6379)
        print(self.conn)
    def process_item(self, item, spider):
        # print(item)
        dic = {
            'name':item['job_name'],
            'salary':item['salary'],
            'company':item['company']
        }
        self.conn.lpush('boss',dic)

settings.py

USER_AGENT="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360EE"
ROBOTSTXT_OBEY=False    #忽略君子协议

ITEM_PIPELINES = {
   'bossPro.pipelines.BossproPipeline': 300,
   'bossPro.pipelines.redisPileLine': 301,
}

相关阅读:
不写helloworld总觉得哪里似乎不对之javascript
SQl中drop与truncate的区别
 对MarshalByRefObject的讲解(转自DuDu)
“模态子窗体关闭后，父窗体也关闭”解决方案
 ENVI5.0 32位工具栏图标不显示解决办法
 HTML5的基础写法
 查询远程服务器数据
 javascript变量、作用域和内存问题
 javascript基本概念
 让应用程序具体相应权限
原文地址：https://www.cnblogs.com/cuiyuanzhang/p/9493496.html