• scrapy


    scrapy安装和简单使用

      scrapy是一个大而全的爬虫组件,依赖twisted,内部基于事件循环的机制实现爬虫的并发

      下载安装:

     - Win:
        下载:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
                    
        pip3 install wheel   
        pip install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl #有些64位安装不了的,可以试下32位的
                    
        pip3 install pywin32
                    
        pip3 install scrapy 
    
     - Linux:
       pip3 install scrapy
    

      

        twisted是什么以及和requests的区别?
        requests是一个Python实现的可以伪造浏览器发送Http请求的模块。
            - 封装socket发送请求
            
        twisted是基于事件循环的异步非阻塞网络框架。
            - 封装socket发送请求
            - 单线程完成并发请求
            PS: 三个相关词
                - 非阻塞:不等待
                - 异步:回调
                - 事件循环:一直循环去检查状态。
    

      

        组件以及执行流程?
        - 引擎找到要执行的爬虫,并执行爬虫的 start_requests 方法,并的到一个 迭代器。
        - 迭代器循环时会获取Request对象,而request对象中封装了要访问的URL和回调函数。
        - 将所有的request对象(任务)放到调度器中,用于以后被下载器下载。
        - 下载器去调度器中获取要下载任务(就是Request对象),下载完成后执行回调函数。
        - 回到spider的回调函数中,
            yield Request()
            yield Item()


      基础命令

        # 创建project
        scrapy  startproject xdb
        
        cd xdb
        
        # 创建爬虫
        scrapy genspider chouti chouti.com
        scrapy genspider cnblogs cnblogs.com
        
        # 启动爬虫
        scrapy crawl chouti
        scrapy crawl chouti --nolog 

      HTML解析:xpath

    	- response.text 
    	- response.encoding
    	- response.body 
    	- response.request
    	# response.xpath('//div[@href="x1"]/a').extract_first()
    	# response.xpath('//div[@href="x1"]/a').extract()
    	# response.xpath('//div[@href="x1"]/a/text()').extract()
    	# response.xpath('//div[@href="x1"]/a/@href').extract()
    

       再次发起请求:yield Request对象

    class ChoutiSpider(scrapy.Spider):
        name = 'chouti'
        allowed_domains = ['chouti.com']
        start_urls = ['http://chouti.com/']
    
        def parse(self, response):
            # print(response,type(response)) # 对象
            # print(response.text)
            """
            from bs4 import BeautifulSoup
            soup = BeautifulSoup(response.text,'html.parser')
            content_list = soup.find('div',attrs={'id':'content-list'})
            """
            # 去子孙中找div并且id=content-list
            f = open('news.log', mode='a+')
            item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
            for item in item_list:
                text = item.xpath('.//a/text()').extract_first()
                href = item.xpath('.//a/@href').extract_first()
                print(href,text.strip())
                f.write(href+'
    ')
            f.close()
    
            page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
            for page in page_list:
                from scrapy.http import Request
                page = "https://dig.chouti.com" + page
                yield Request(url=page,callback=self.parse) # https://dig.chouti.com/all/hot/recent/2
    

      注意:如果爬虫过程有编码报错,尝试加上下面这句代码

    # import sys,os,io
    # sys.stdout=io.TextIOWrapper(sys.stdout.buffer,encoding='gb18030')
    

      如果爬虫不执行parse,可修改配置文件这项

    ROBOTSTXT_OBEY = False
    

      

      对于上面的执行过程存在两个缺点:1.每次发起请求,都有打开连接和关闭连接的过程 2.分工不明确,既然解析过程,又有存储过程

      针对上面两个问题,scrapy提供了持久化

    持久化pipeline/items

      1.定义pipeline类,这里编写你的存储过程

     class XXXPipeline(object):
        def process_item(self, item, spider):
             return item

      2.定义Item类,这里定义你要接受的数据          

    class XdbItem(scrapy.Item):
         href = scrapy.Field()
         title = scrapy.Field()
    

      3.settings里配置

    ITEM_PIPELINES = {
        'xdb.pipelines.XdbPipeline': 300,
    }

      爬虫每次执行yield Item对象,process_item就会调用一次

            编写pipeline:

    	'''
    	源码内容:
    	1. 判断当前XdbPipeline类中是否有from_crawler
    		有:
    			obj = XdbPipeline.from_crawler(....)
    		否:
    			obj = XdbPipeline()
    	2. obj.open_spider()
    	
    	3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
    	
    	4. obj.close_spider()
    	'''
    from scrapy.exceptions import DropItem
    
    class FilePipeline(object):
    
    	def __init__(self,path):
    		self.f = None
    		self.path = path
    
    	@classmethod
    	def from_crawler(cls, crawler):
    		"""
    		初始化时候,用于创建pipeline对象
    		:param crawler:
    		:return:
    		"""
    		print('File.from_crawler')
    		path = crawler.settings.get('HREF_FILE_PATH')
    		return cls(path)
    
    	def open_spider(self,spider):
    		"""
    		爬虫开始执行时,调用
    		:param spider:
    		:return:
    		"""
    		print('File.open_spider')
    		self.f = open(self.path,'a+')
    
    	def process_item(self, item, spider):
    		# f = open('xx.log','a+')
    		# f.write(item['href']+'
    ')
    		# f.close()
    		print('File',item['href'])
    		self.f.write(item['href']+'
    ')
    		
    		# return item  	# 交给下一个pipeline的process_item方法
    		raise DropItem()# 后续的 pipeline的process_item方法不再执行
    
    	def close_spider(self,spider):
    		"""
    		爬虫关闭时,被调用
    		:param spider:
    		:return:
    		"""
    		print('File.close_spider')
    		self.f.close()
    

       注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
            pipeline持久化,from_crawler指定写入路径,open_spider打开链接,close_spider关闭链接,
            process_item中执行持久化操作,return item就交给下一个pipeline的process_item方法,如果
            raise DropItem(),后续的pipeline的process_item方法不再执行

    去重规则

            编写类

    from scrapy.dupefilter import BaseDupeFilter
    from scrapy.utils.request import request_fingerprint
    
    class XdbDupeFilter(BaseDupeFilter):
    
    	def __init__(self):
    		self.visited_fd = set()
    
    	@classmethod
    	def from_settings(cls, settings):
    		return cls()
    
    	def request_seen(self, request):
    		fd = request_fingerprint(request=request)
    		if fd in self.visited_fd:
    			return True
    		self.visited_fd.add(fd)
    
    	def open(self):  # can return deferred
    		print('开始')
    
    	def close(self, reason):  # can return a deferred
    		print('结束')
    
    	# def log(self, request, spider):  # log that a request has been filtered
    	#     print('日志')
    

      配置

            # 修改默认的去重规则
            # DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
            DUPEFILTER_CLASS = 'xdb.dupefilters.XdbDupeFilter'

       爬虫使用

    class ChoutiSpider(scrapy.Spider):
    	name = 'chouti'
    	allowed_domains = ['chouti.com']
    	start_urls = ['https://dig.chouti.com/']
    
    	def parse(self, response):
    		print(response.request.url)
    		# item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
    		# for item in item_list:
    		#     text = item.xpath('.//a/text()').extract_first()
    		#     href = item.xpath('.//a/@href').extract_first()
    
    		page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
    		for page in page_list:
    			from scrapy.http import Request
    			page = "https://dig.chouti.com" + page
    			# yield Request(url=page,callback=self.parse,dont_filter=False) # https://dig.chouti.com/all/hot/recent/2
    			yield Request(url=page,callback=self.parse,dont_filter=True) # https://dig.chouti.com/all/hot/recent/2
    

             注意:
                - request_seen中编写正确逻辑
                - dont_filter=False
                
                如果想实现去重,可以自定义dupefilter类,在request_seen方法中执行去重操作,
                还可以这么做,在yield request时dont_filter改成False,默认也是False

    深度与优先级

            - 深度 
                - 最开始是0
                - 每次yield时,会根据原来请求中的depth + 1
                配置:DEPTH_LIMIT 深度控制
            - 优先级 
                - 请求被下载的优先级 -= 深度 * 配置 DEPTH_PRIORITY 
                配置:DEPTH_PRIORITY

       获取深度:response.meta.get("depth", 0)

    cookie设置

      方式一,携带和解析

        cookie_dict = {}
        def parse(self, response):
    
            # 携带 解析的方式
            #去响应头里获取cookie,cookie保存在cookie_jar对象
            from scrapy.http.cookies import CookieJar
            from urllib.parse import urlencode
            cookie_jar = CookieJar()
            cookie_jar.extract_cookies(response, response.request)
            # 去对象中将cookie解析到字典
            for k, v in cookie_jar._cookies.items():
                for i, j in v.items():
                    for m, n in j.items():
                        self.cookie_dict[m] = n.value
    
            yield Request(
                url="https://dig.chouti.com/login",
                method="POST",
                #body 可以自定拼接,也可以使用urlencode拼接
                body="phone=8613121758648&password=woshiniba&oneMonth=1",
                cookies=self.cookie_dict,
                headers={
                    "Content-Type":'application/x-www-form-urlencoded; charset=UTF-8'
                },
                callback=self.check_login
            )
    
        def check_login(self, response):
            print(response.text)
            yield Request(
                url="https://dig.chouti.com/all/hot/recent/1",
                cookies=self.cookie_dict,
                callback=self.index
            )
    
        def index(self, response):
            news_list = response.xpath("//div[@id='content-list']/div[@class='item']")
            for new in news_list:
                link_id = new.xpath(".//div[@class='part2']/@share-linkid").extract_first()
                yield Request(
                    url="http://dig.chouti.com/link/vote?linksId=%s"%(link_id, ),
                    method="POST",
                    cookies=self.cookie_dict,
                    callback=self.check_result
                )
    
            page_list = response.xpath("//div[@id='dig_lcpage']//a/@href").extract()
            for page in page_list:
                page = "https://dig.chouti.com" + page
                yield Request(url=page, callback=self.index)
    
        def check_result(self, response):
            print(response.text)

      方式二,meta

    meta={'cookiejar': True}
    

      

    start_urls

            scrapy引擎获取start_requests函数返回的结果(Request列表)封装成一个迭代器,放入的调度器中,
            下载器从调度器中,通过__next__来获取Request对象

             - 定制:可以去redis中获取,也可以设置代理(os.envrion)

            - 内部原理:
            """
            scrapy引擎来爬虫中取起始URL:
                1. 调用start_requests并获取返回值
                2. v = iter(返回值)
                3. 
                    req1 = 执行 v.__next__()
                    req2 = 执行 v.__next__()
                    req3 = 执行 v.__next__()
                    ...
                4. req全部放到调度器中
                
            """

            - 编写

    class ChoutiSpider(scrapy.Spider):
    	name = 'chouti'
    	allowed_domains = ['chouti.com']
    	start_urls = ['https://dig.chouti.com/']
    	cookie_dict = {}
    	
    	def start_requests(self):
    		# 方式一:
    		for url in self.start_urls:
    			yield Request(url=url)
    		# 方式二:
    		# req_list = []
    		# for url in self.start_urls:
    		#     req_list.append(Request(url=url))
    		# return req_list

            

    代理

            问题:scrapy如何加代理?
                - 环境变量 start_requests在爬虫启动时,提前在os.envrion中设置代理
                - meta  yield Request的时候meta属性设置
                - 自定义下载中间件,在process_request中加入,这种方式可以实现随机代理

      内置代理:在爬虫启动时,提前在os.envrion中设置代理即可。

    class ChoutiSpider(scrapy.Spider):
    	name = 'chouti'
    	allowed_domains = ['chouti.com']
    	start_urls = ['https://dig.chouti.com/']
    	cookie_dict = {}
    
    	def start_requests(self):
    		import os
    		os.environ['HTTPS_PROXY'] = "http://root:woshiniba@192.168.11.11:9999/"
    		os.environ['HTTP_PROXY'] = '19.11.2.32',
    		for url in self.start_urls:
    			yield Request(url=url,callback=self.parse)

      meta设置代理:yield Request的时候设置meta属性

    class ChoutiSpider(scrapy.Spider):
    	name = 'chouti'
    	allowed_domains = ['chouti.com']
    	start_urls = ['https://dig.chouti.com/']
    	cookie_dict = {}
    
    	def start_requests(self):
    		for url in self.start_urls:
    			yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:woshiniba@192.168.11.11:9999/"'})
    

      自定义下载中间件,在process_request中加代理,在这里你可以实现随机代码的过程

    import base64
    import random
    from six.moves.urllib.parse import unquote
    
    try:
        from urllib2 import _parse_proxy
    except ImportError:
        from urllib.request import _parse_proxy
    from six.moves.urllib.parse import urlunparse
    from scrapy.utils.python import to_bytes
    
    
    class XdbProxyMiddleware(object):
    
        def _basic_auth_header(self, username, password):
            user_pass = to_bytes(
                '%s:%s' % (unquote(username), unquote(password)),
                encoding='latin-1')
            return base64.b64encode(user_pass).strip()
    
        def process_request(self, request, spider):
            PROXIES = [
                "http://root:woshiniba@192.168.11.11:9999/",
                "http://root:woshiniba@192.168.11.12:9999/",
                "http://root:woshiniba@192.168.11.13:9999/",
                "http://root:woshiniba@192.168.11.14:9999/",
                "http://root:woshiniba@192.168.11.15:9999/",
                "http://root:woshiniba@192.168.11.16:9999/",
            ]
            url = random.choice(PROXIES)
    
            orig_type = ""
            proxy_type, user, password, hostport = _parse_proxy(url)
            proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
    
            if user:
                creds = self._basic_auth_header(user, password)
            else:
                creds = None
            request.meta['proxy'] = proxy_url
            if creds:
                request.headers['Proxy-Authorization'] = b'Basic ' + creds

      

    选择器和解析器

    html = """<!DOCTYPE html>
    <html>
        <head lang="en">
            <meta charset="UTF-8">
            <title></title>
        </head>
        <body>
            <ul>
                <li class="item-"><a id='i1' href="link.html">first item</a></li>
                <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
                <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
            </ul>
            <div><a href="llink2.html">second item</a></div>
        </body>
    </html>
    """
    
    from scrapy.http import HtmlResponse
    from scrapy.selector import Selector
    
    response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
    
    
    # hxs = Selector(response)
    # hxs.xpath()
    response.xpath('')  

    下载中间件

      在process_request里,你可以干这些事:

    • 返回HtmlResponse对象,不执行下载,但是还是会执行process_response
    • 返回Request对象,发起下次请求
    • 抛出异常IgnoreRequest,废弃当前请求,会执行process_exception
    • 对请求进行加工,比如设置User-Agent

      写中间件:

    from scrapy.http import HtmlResponse
    from scrapy.http import Request
    
    class Md1(object):
    	@classmethod
    	def from_crawler(cls, crawler):
    		# This method is used by Scrapy to create your spiders.
    		s = cls()
    		return s
    
    	def process_request(self, request, spider):
    		# Called for each request that goes through the downloader
    		# middleware.
    
    		# Must either:
    		# - return None: continue processing this request
    		# - or return a Response object
    		# - or return a Request object
    		# - or raise IgnoreRequest: process_exception() methods of
    		#   installed downloader middleware will be called
    		print('md1.process_request',request)
    		# 1. 返回Response
    		# import requests
    		# result = requests.get(request.url)
    		# return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)
    		# 2. 返回Request
    		# return Request('https://dig.chouti.com/r/tec/hot/1')
    
    		# 3. 抛出异常
    		# from scrapy.exceptions import IgnoreRequest
    		# raise IgnoreRequest
    
    		# 4. 对请求进行加工(*)
    		# request.headers['user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    
    		pass
    
    	def process_response(self, request, response, spider):
    		# Called with the response returned from the downloader.
    
    		# Must either;
    		# - return a Response object
    		# - return a Request object
    		# - or raise IgnoreRequest
    		print('m1.process_response',request,response)
    		return response
    
    	def process_exception(self, request, exception, spider):
    		# Called when a download handler or a process_request()
    		# (from other downloader middleware) raises an exception.
    
    		# Must either:
    		# - return None: continue processing this exception
    		# - return a Response object: stops process_exception() chain
    		# - return a Request object: stops process_exception() chain
    		pass
    

       配置

    DOWNLOADER_MIDDLEWARES = {
       #'xdb.middlewares.XdbDownloaderMiddleware': 543,
    	# 'xdb.proxy.XdbProxyMiddleware':751,
    	'xdb.md.Md1':666,
    	'xdb.md.Md2':667,
    }
    

      应用:
         - user-agent
         - 代理

    爬虫中间件 

    • process_start_requests只在爬虫启动时执行一次,在下载中间件之前执行
    • process_spider_input在下载中间件执行完后,调用回调函数时执行
    • process_spider_output在回调函数执行完后执行

      编写:

    class Sd1(object):
    	# Not all methods need to be defined. If a method is not defined,
    	# scrapy acts as if the spider middleware does not modify the
    	# passed objects.
    
    	@classmethod
    	def from_crawler(cls, crawler):
    		# This method is used by Scrapy to create your spiders.
    		s = cls()
    		return s
    
    	def process_spider_input(self, response, spider):
    		# Called for each response that goes through the spider
    		# middleware and into the spider.
    
    		# Should return None or raise an exception.
    		return None
    
    	def process_spider_output(self, response, result, spider):
    		# Called with the results returned from the Spider, after
    		# it has processed the response.
    
    		# Must return an iterable of Request, dict or Item objects.
    		for i in result:
    			yield i
    
    	def process_spider_exception(self, response, exception, spider):
    		# Called when a spider or process_spider_input() method
    		# (from other spider middleware) raises an exception.
    
    		# Should return either None or an iterable of Response, dict
    		# or Item objects.
    		pass
    
    	# 只在爬虫启动时,执行一次。
    	def process_start_requests(self, start_requests, spider):
    		# Called with the start requests of the spider, and works
    		# similarly to the process_spider_output() method, except
    		# that it doesn’t have a response associated.
    
    		# Must return only requests (not items).
    		for r in start_requests:
    			yield r

      配置

    SPIDER_MIDDLEWARES = {
       # 'xdb.middlewares.XdbSpiderMiddleware': 543,
    	'xdb.sd.Sd1': 666,
    	'xdb.sd.Sd2': 667,
    }

      应用:
        - 深度
        - 优先级

    定制命令

      单爬虫运行:

    import sys
    from scrapy.cmdline import execute
    
    if __name__ == '__main__':
        execute(["scrapy","crawl","chouti","--nolog"])

            - 所有爬虫:
                - 在spiders同级创建任意目录,如:commands
                - 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
                - 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称'
                - 在项目目录执行命令:scrapy crawlall

    信号

      使用框架预留的位置,帮助你自定义一些功能

    class MyExtend(object):
        def __init__(self):
            pass
    
        @classmethod
        def from_crawler(cls, crawler):
            self = cls()
    
            crawler.signals.connect(self.x1, signal=signals.spider_opened)
            crawler.signals.connect(self.x2, signal=signals.spider_closed)
    
            return self
    
        def x1(self, spider):
            print('open')
    
        def x2(self, spider):
            print('close')
    

        配置

    EXTENSIONS = {
        'xdb.ext.MyExtend':666,
    }
    

      

    骚师博客

  • 相关阅读:
    设计模式之适配器模式(Adapter)
    数组中的趣味题(二)
    VSTS 2008 自定义签入代码审查策略
    自定义 Vista 系统下程序运行级别
    全国省份,城市,地区全数据(SQL版与XML版)包括各城市邮编
    LINQ 从数据库读数据生成 XML
    IE 8 Beta 2 初体验 隐藏了的"IE7模式"
    利用宏帮助快速录入代码
    你现在的生活是你n年前决定的
    控制参数个数的几种方式
  • 原文地址:https://www.cnblogs.com/xinsiwei18/p/10264479.html
Copyright © 2020-2023  润新知