• Python 创建项目时配置 Scrapy 自定义模板


    Python 创建项目时配置 Scrapy 自定义模板

    1.找到 Scrapy 自定义模板相关文件

    python安装目录+PythonLibsite-packagesscrapy emplatesprojectmodule

     

    2.开始编写 Python 自定义模板

    settings.py.tmpl:Python项目框架的配置类

    # Scrapy settings for $project_name project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://docs.scrapy.org/en/latest/topics/settings.html
    #     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = '$project_name'
    
    SPIDER_MODULES = ['$project_name.spiders']
    NEWSPIDER_MODULE = '$project_name.spiders'
    
    '''
    Scrapy 提供 5 层 Log Level:
    CRITICAL - 严重错误(critical)
    ERROR - 一般错误(regular errors)
    WARNING - 警告信息(warning messages)
    INFO - 一般信息(informational messages)
    DEBUG - 调试信息(debugging messages)
    '''
    LOG_LEVEL = 'WARNING'
    
    '''
    有一些网站不喜欢被爬虫程序访问,所以会检测连接对象;
    如果是爬虫程序,也就是非人点击访问,它就会不让你继续访问;
    所以为了要让程序可以正常运行,需要隐藏自己的爬虫程序的身份。
    此时,可以通过设置User Agent的来达到隐藏身份的目的,User Agent的中文名为用户代理,简称UA。
    '''
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = '$project_name (+http://www.yourdomain.com)'
    USER_AGENT = 'Mozilla/5.0'
    '''
    USER_AGENT = {"User-Agent": random.choice(
        ['Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
         'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
         'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
         'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
         'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
         'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
         'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
         'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
         'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
         'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
         'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
         'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
         'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30',
         'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'])}
    '''
    
    
    '''
    Obey robots.txt rules
    robots.txt 是遵循 Robot协议 的一个文件,它保存在网站的服务器中
    作用:告诉搜索引擎爬虫,本网站哪些目录下的网页 不希望 你进行爬取收录。在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件,然后决定该网站的爬取范围。
    当然,我们并不是在做搜索引擎,而且在某些情况下我们想要获取的内容恰恰是被 robots.txt 所禁止访问的。所以,某些时候,我们就要将此配置项设置为 False ,拒绝遵守 Robot协议 !
    '''
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 1  # 延迟下载,防止被封
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    '$project_name.middlewares.${ProjectName}SpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    '$project_name.middlewares.${ProjectName}DownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://docs.scrapy.org/en/latest/topics/extensions.html
    # 禁用扩展(Disabling an extension)(avoid twisted.internet.error.CannotListenError)
    EXTENSIONS = {
        'scrapy.extensions.telnet.TelnetConsole': None,
    }
    
    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
        '$project_name.pipelines.${ProjectName}Pipeline': 300,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    settings.py.tmpl View Code

    pipelines.py.tmpl:Python项目框架的管道控制类

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    # useful for handling different item types with a single interface
    import pymysql
    
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 管道配置
    '''
    class ${ProjectName}Pipeline:
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 配置数据库连接>>>打开链接
        '''
    
        def open_spider(self, spider):
            # 数据库连接
            self.conn = pymysql.connect(
                host='127.0.0.1',  #服务器IP
                port=3306,  # 服务器端口号:不是字符串不需要加引号。
                user='root',
                password='123456',
                db='database',
                charset='utf8')
            # 得到一个可以执行SQL语句的光标对象
            self.cursor = self.conn.cursor()  # 执行完毕返回的结果集默认以元组显示
            print(spider.name, '打开数据库连接,爬虫开始了...')
    
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 配置数据库连接>>>关闭链接
        '''
    
        def close_spider(self, spider):
            self.cursor.close()
            self.conn.close()
            # self.file.close()
            print(spider.name, '数据库连接关闭,爬虫结束了...')
    
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 默认进入此方法
        '''
    
        def process_item(self, item, spider):
            print("into Pipeline's process_item")
            if spider.name == 'first_py':
                print("into pipeline if")
                self.save_test(item)
            else:
                print("into pipeline else")
            return item
    
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 保存test表数据
        '''
    
        def save_test(self, item):
            print("into save_test")
            # 先检查数据库是否存在,不存在则保存
            # 定义将要执行的sql语句
            sql_count = 'select count(id) from test where name = %s'
            # 拼接并执行sql语句
            self.cursor.execute(sql_count, [item['name']])
            # 取到查询结果>>>取一条
            results = self.cursor.fetchone()
            if 0 == results[0]:
                try:
                    '''
                    print(item['name'])
                    print(item['type'])
                    print(item['content'])
                    '''
                    sql = "insert into test(name, type, content) values(%s, %s, %s)"
                    self.cursor.execute(sql, [item['name'], item['type'], item['content']])
                    thisId = self.cursor.lastrowid
                    print('test表保存成功,id为:' + repr(thisId))
                    self.conn.commit()
                except Exception as ex:
                    print("出现如下异常%s" % ex)
                    print('回滚')
                    self.conn.rollback()
    pipelines.py.tmpl View Code

    middlewares.py.tmpl:Python项目框架的中间件配置类(默认,无需配置)

    items.py.tmpl:自定义实体属性类

    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 自定义实体属性
    '''
    class ${ProjectName}Item(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        pass  # 占位符
    
    
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 对应数据库中的test表
    '''
    
    
    class TestItem(scrapy.Item):
        # 名称(对应 test 表中的name字段)
        name = scrapy.Field()
        # 类型
        type = scrapy.Field()
        # 详情内容
        content = scrapy.Field()
    items.py.tmpl View Code

    spiders>test.py:测试类可配置可不配置

    from urllib.request import urlopen
    
    import scrapy
    from bs4 import BeautifulSoup
    
    import requests
    
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 测试爬虫
    '''
    
    
    class TestSpider(scrapy.Spider):
        # 与 run.py 启动控制类中的名字一致
        name = 'test'
        # 允许访问的域名
        allowed_domains = ['baidu.com']
        # 开始爬取的链接
        start_urls = ['https://image.baidu.com/']
        # 自定义变量
        link = 'https://image.baidu.com'
    
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 默认进入此方法
        '''
        def parse(self, response, link=link):
            '''python抓取数据方式>>>开始'''
            # 第一种:response 获取
            data = response.text
            # 第二种:requests 获取
            data = requests.get(link)
            data = data.text
            # 第三种:urlopen 获取
            data = urlopen(link).read()
            # Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码
            data = BeautifulSoup(data, "html.parser")
            # 第四种:xpath 解析获取
            data = response.xpath('//div[@id="endText"]').get()
            # Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码
            data = BeautifulSoup(data, 'html.parser')
            print(data)
            '''python抓取数据方式>>>结束'''
            # 调用 getLinkContent 方法
            request = scrapy.Request(link, callback=self.getLinkContent)
            # 传参赋值
            request.meta['link'] = link
            request.meta['data'] = data
            yield request
    
        '''
        @Author: System
        @Date: newTimes
        @Description: TODO 根据link链接封装并保存数据
        '''
        def getLinkContent(self, response):
            print('开始抓取XXX的链接...')
            print(response.meta['link'])
            content = response.xpath('//div[@id="content"]')
            content = "".join(content.extract())
            # 实例化 TestItem 这个类,第一个 name 是在items.py中定义的属性(Ps:自己导入)
            items = TestItem(name='name',
                             type=1,
                             content=content)
            # 用yield关键字把它传去管道
            yield items
    test.py View Code

    run.py:Python项目控制统一启动类

    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    '''
    @Author: System
    @Date: newTimes
    @Description: TODO 同一进程运行多个spider
    '''
    process = CrawlerProcess(get_project_settings())
    '''启动项目开始爬起文件类'''
    process.crawl('test')
    
    process.start()
    run.py View Code

    3.测试 Python 自定义模板 

    scrapy 创建 pyProject 新项目:scrapy startproject pyProject

    PyCharm 打开刚刚创建的 pyProject 项目

    需要修改的点:

    1.数据库配置

    2. test 测试类需手动导入(Ps:测试类仅供参考)

         

    3.配置 run.py 启动类

    点击右上角的Add Configurations

  • 相关阅读:
    ubuntu下配置django+apache+mysql+mod_python+Python
    Makefile 学习
    一个查重算法的多种实现思路
    MongoDB基础
    基于ASP.NET MVC 3的企业应用项目总结
    CruiseControl.Net持续集成平台搭建总结
    Cnblogs Start
    JavaScript中的this
    Entity Framework 4.0 的一个bug :DefaultValue问题
    .Net平台下的B/S开发框架
  • 原文地址:https://www.cnblogs.com/mjtabu/p/13596449.html
Copyright © 2020-2023  润新知