• scrapy使用指南


    创建scrapy项目:

    scrapy startproject 项目名

    cd到项目名下

    scrapy genspider 爬虫名 www.baidu.com(网站网址)

    之后按照提示创建爬虫文件(官方测试网站为http://quotes.toscrape.com/)

    创建启动文件

    from scrapy.cmdline import execute
    execute(['scrapy','crawl','quotes'])

    quotes是爬虫名,该文件创建在scrapy项目根目录下

    css选择器:

    response.css('.text::text').extract()

    这里为提取所有带有class=’text’ 这个属性的元素里面的text返回的是一个列表

    response.css('.text::text').extract_first()

    这是取第一条,返回的是str

    print(response.css("div span::attr(class)").extract())

    这是取元素

    Xpath选择器:

    url = response.url+response.xpath('/html/body/div/div[2]/div[1]/div[1]/div/a[1]/@href').extract_first()

    和原来用法基本一样,这里是获取一个url 然后跟网站的主url拼接了

    print(response.xpath("//a[@class='tag']/text()").extract())

    取带有class=’tag’属性的超链接中间的文本内容

    print(response.url)
    print(response.status)

    打印该请求的url,打印请求的状态码

    保存为json形式的东西

    scrapy crawl quotes -o quotes.json

    json lines存储

    scrapy crawl quotes -o quotes.jl

    scrapy crawl quotes -o quotes.csv

    scrapy crawl quotes -o quotes.xml

    scrapy crawl quotes -o quotes.pickle

    scrapy crawl quotes -o quotes.marshal

    scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv

    piplines.py中的操作

    from scrapy.exceptions import DropItem
    class HelloPipeline(object):
        def __init__(self):
            self.limit = 50
        def process_item(self,item,spider):
            if item['name']:
                if len(item['name']) > self.limit:
                    item['name'] = item['name'][:self.limit].rstrip()+'。。。'
                return item
            else:
                return DropItem

    import pymongo
    class MongoPipline(object):
        def __init__(self,mongo_url,mongo_db):
            self.mongo_url = mongo_url
            self.mongo_db = mongo_db
        @classmethod
        def from_crawler(cls,crawler):
            return cls(mongo_url=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'))

        def open_spider(self,spider):
            print(self.mongo_url,self.mongo_db)
            self.client = pymongo.MongoClient(self.mongo_url)
            self.db = self.client[self.mongo_db]

        def process_item(self,item,spider):
            self.db['name'].insert(dict(item))
            print(item)
            return item

        def close_spider(self,spider):
            self.client.close()

    记得开setting.py:

    ITEM_PIPELINES = {
       'hello.pipelines.HelloPipeline': 300,
       'hello.pipelines.MongoPipline': 400,
    }
    MONGO_URI = '127.0.0.1'
    MONGO_DB = 'hello'

    DownloadMiddleware

    核心方法:

    Process_request(self,request,spider)

    Return None:继续处理这个request,直到返回response,通常用来修改request

    Return Response 直接返回该response

    Return Request 将返回的request 重新放归调度队列,当成一个新的request用

    Return IgnoreRequest 抛出异常,process_exception被一次调用,

    Process_response(self,request,response,spider)

    Return request将返回的request 重新放归调度队列,当成一个新的request用

    Return response 继续处理该response直到结束

    Process_exception(request,excetion,spider)

    Return IgnoreRequest 抛出异常,process_exception被一次调用,

    通过重写中间件给request加useragent,将返回的状态码都改成201

    在setting里:

    DOWNLOADER_MIDDLEWARES = {
       'dingdian.middlewares.AgantMiddleware': 543,
    }

    在middleware里:

    import random
    class AgantMiddleware(object):
        def __init__(self):
            self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
        def process_request(self,request,spider):
            request.headers['User-Agent'] = random.choice(self.user_agent)
            print(request.headers)

        def process_response(self,request,response,spider):
            response.status=201
            return response

     

    scrapy两种请求方式

    一种

    import scrapy

    yield scrapy.Request(begin_url,self.first)

    第二种

    from scrapy.http import Request

    yield Request(url,self.first,meta={'thename':pic_name[0]})

    使用post请求的方法:

    from scrapy import FormRequest ##Scrapy中用作登录使用的一个包

    formdata = {
        'username': 'wangshang',
        'password': 'a706486'
    }
    yield scrapy.FormRequest(
        url='http://172.16.10.119:8080/bwie/login.do',
        formdata=formdata,
        callback=self.after_login,
    )

    中间键添加代理IP以及header头

    class UserAgentMiddleware(object):
        def __init__(self):
            self.user_agent = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:58.0) Gecko/20100101 Firefox/58.0']
        def process_request(self,request,spider):
            request.meta['proxy'] = 'http://'+'175.42.123.111:33995'

  • 相关阅读:
    Transaction And Lock--使用资源锁来控制并发
    曲演杂坛--页拆分
    维护建议--文件和文件组
    维护建议--开发设计
    维护建议--服务器磁盘
    维护建议--数据库备份
    TSQL--查找连续登陆用户
    shell脚本传递带有空格的参数的解决方法
    oozie无法识别hadoopHA中的ns1
    ERROR tool.ImportTool: Import failed: java.io.IOException: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf
  • 原文地址:https://www.cnblogs.com/6-b-timer-shaft/p/10632974.html
Copyright © 2020-2023  润新知