• scrapy


    参考文章:  https://www.cnblogs.com/liuqingzheng/articles/10261760.html

    1,安装:

    #Windows平台
        1、pip3 install wheel #安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
        3、pip3 install lxml
        4、pip3 install pyopenssl
        5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
        6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/ #搜索twisted,版本与python版本对应
        7、执行pip3 install 下载目录Twisted-17.9.0-cp36-cp36m-win_amd64.whl
        8、pip3 install scrapy
      
    #Linux平台
        1、pip3 install scrapy

    2,命令行

    #1 查看帮助
        scrapy -h
        scrapy <command> -h
    
    #2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
        Global commands:
            startproject * #创建项目
            genspider *   #创建爬虫程序 1,cd myscrapy | scrapy genspider tmall www.tmall.com
            settings     #如果是在项目目录下,则得到的是该项目的配置
            runspider    #运行一个独立的python文件,不必创建项目
            shell        #scrapy shell url地址  在交互式调试,如选择器规则正确与否
            fetch        #独立于程单纯地爬取一个页面,可以拿到请求头
            view         #下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
            version *     #scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
        Project-only commands:
            crawl *        # scrapy crawl tamll 运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
            check        #检测项目中有无语法错误
            list         #列出项目中所包含的爬虫名
            edit         #编辑器,一般不用
            parse        #scrapy parse url地址 --callback 回调函数  #以此可以验证我们的回调函数是否正确
            bench        #scrapy bentch压力测试
    
    #3 官网链接
        https://docs.scrapy.org/en/latest/topics/commands.html

    3, 项目结构以及爬虫应用简介

    project_name/
       scrapy.cfg
       project_name/
           __init__.py
           items.py
           pipelines.py
           settings.py
           spiders/
               __init__.py
               爬虫1.py
               爬虫2.py
               爬虫3.py

    文件说明:

    • scrapy.cfg  项目的主配置信息,用来部署scrapy时使用,爬虫相关的配置信息在settings.py文件中。
    • items.py    设置数据存储模板,用于结构化数据,如:Django的Model
    • pipelines    数据处理行为,如:一般结构化的数据持久化
    • settings.py 配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必须大写否则视为无效,正确写法USER_AGENT='xxxx'
    • spiders      爬虫目录,如:创建文件,编写爬虫规则

    注意:一般创建爬虫文件时,以网站域名命名

    #tamll.py

    #
    -*- coding: utf-8 -*- import scrapy from urllib.parse import urlencode from ..items import MyprojectItem class TmallSpider(scrapy.Spider): name = 'tmall' allowed_domains = ['www.tmall.com','httpbin.org'] #爬取链接 # 本质上走的是父类的start_requests,因为自己重写了start_requests,就不需要在这里定义url了, # start_urls = ['http://httpbin.org/get'] #自定义请求头 custom_settings = { 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' } } def __init__(self,pro=None,*args,**kwargs): super(TmallSpider,self).__init__(*args,**kwargs) self.params = { 'q':pro, 'totalPage':1, 'jumpto':1, } self.start_urls = 'http://list.tmall.com/search_product.htm?' + urlencode(self.params) #重写父类start_requests def start_requests(self): # for url in self.start_urls: yield scrapy.Request(url=self.start_urls,callback=self.get_totallpage,dont_filter=True) #dont_filter=True 不去重 #解析函数(初次请求都会走parse) def get_totallpage(self, response): # print('我被解析了') # print(response.text) res = response.css('[name="totalPage"]::attr(value)').extract_first() self.params['totalPage'] = int(res) for i in range(1,self.params['totalPage']+1): # for i in range(1,2): self.params['jumpto'] = i self.url = 'http://list.tmall.com/search_product.htm?' + urlencode(self.params) yield scrapy.Request(url=self.url,callback=self.get_info,dont_filter=True) def get_info(self,response): elements = response.css('[class="product "]') for element in elements: title = element.css('[class="productTitle"] a::attr(title)').extract_first() price = element.css('[class="productPrice"] em::attr(title)').extract_first() print(title,price) item = MyprojectItem() item.title = title item.price = price yield item
    #items.py
    class
    MyprojectItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() price = scrapy.Field()

    配置地理IP:

    #middleware.py   
    class MyprojectDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            # return None
    
            #配置代理IP
            proxy =  requests.get(url='http://127.0.0.1:5010/get').text
            request.meta['proxy'] = 'http://' + proxy
            return None
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
    
            #如果ip被封了,则删除被封IP,换新IP重新请求
            old_ip = request.meta['proxy'].split('//')[1]
            requests.get('http://127.0.0.1:5010/delete/?proxy={}'.format(old_ip))
    
            proxy = request.get(url='http://127.0.0.1:5010/get').text
            request.meta['proxy'] = 'http://' + proxy
            return request
    
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)

    执行文件:

    #在项目目录下新建:run.py
    from scrapy.cmdline import execute
    execute(['scrapy', 'crawl', 'tmall','-a','pro=男装','--nolog']) #没有日志
    # execute(['scrapy', 'crawl', 'tmall','-a','pro=男装'])
    # execute(['scrapy', 'crawl', 'tmall'])
  • 相关阅读:
    Intent的跳转和传值
    Intent传值的学习
    Activity,Window,View之间是什么关系?
    MATCH_PARENT和FILL_PARENT之间的区别?
    Activity的运行过程
    onCreate和onStart谁的开销大?
    SDKManager连不上墙外的网,列表刷新不出来怎么办?
    AndroidEclipse里的视图里想添加SDK Manager但是找不到怎么办?
    出现“Unable to resolve target 'android-XXX'”怎么处理?
    安卓进程的生命周期
  • 原文地址:https://www.cnblogs.com/HZLS/p/11551299.html
Copyright © 2020-2023  润新知