• 爬虫


    1. 代码

    import re
    
    import scrapy
    
    from Fang.items import esf_FangItem
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'example'
        allowed_domains = ['www.fang.com']
        start_urls = ['https://www.fang.com/SoufunFamily.htm']
    
        def parse(self, response):
            trs = response.xpath('//div[@id="c02"]//tr')
            province = None
            for tr in trs:
                province_f = tr.xpath('./td[2]//text()').get()
                province_f = re.sub(r"s", "", province_f)
                if province_f:
                    province = province_f
                cities = tr.xpath('./td[3]/a')
                for i in cities:
                    city = i.xpath('./text()').get()
                    city_url = i.xpath('./@href').get()
                    # print(city, city_url)
                    yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)})
                #     break
                # break
    
        def parse_url(self, response):
            print(2)

    2. 问题描述运行项目时,parse_url不执行,即不能打印2

      日志打印如下:

    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bj.fang.com': <GET http://bj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sh.fang.com': <GET http://sh.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tj.fang.com': <GET http://tj.fang.com/>
    2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'cq.fang.com': <GET http://cq.fang.com/>

    3. 解答

      百度得知是二次解析的域名被过滤掉了

      解决方法:

       

    方法一:
      去掉域名: allowed_domains = ['www.fang.com']
      或将其改为:  allowed_domains = ['fang.com']

    方法二:
      加上:
    dont_filter=True (不推荐此方法)
      yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}, dont_filter=True)
     
     
  • 相关阅读:
    修改MSSql数据库名
    系统更新0x8DDD0007号错误解决方案
    win7密匙 win7永久激活工具
    Ps制作的立体字效果
    PS合成人物与风景
    word打不开_如何删除normal.dot
    查看自己的IP地址和网卡的MAC地址
    char varchar nvarchar区别
    配置节点简单使用
    线程相关的概念
  • 原文地址:https://www.cnblogs.com/JackShi/p/12532987.html
Copyright © 2020-2023  润新知