• 2017.08.04 Python网络爬虫之Scrapy爬虫实战二 天气预报


    1.项目准备:网站地址:http://quanzhou.tianqi.com/

    2.创建编辑Scrapy爬虫:

    scrapy startproject weather

    scrapy genspider HQUSpider quanzhou.tianqi.com

    项目文件结构如图:

     3.修改Items.py:

    4.修改Spider文件HQUSpider.py:

    (1)先使用命令:scrapy shell http://quanzhou.tianqi.com/   测试和获取选择器:

    (2)试验选择器:打开chrome浏览器,查看网页源代码:

    (3)执行命令查看response结果:

    (4)编写HQUSpider.py文件:

    # -*- coding: utf-8 -*-
    import scrapy
    from weather.items import WeatherItem

    class HquspiderSpider(scrapy.Spider):
    name = 'HQUSpider'
    allowed_domains = ['tianqi.com']
    citys=['quanzhou','datong']
    start_urls = []
    for city in citys:
    start_urls.append('http://'+city+'.tianqi.com/')
    def parse(self, response):
    subSelector=response.xpath('//div[@class="tqshow1"]')
    items=[]
    for sub in subSelector:
    item=WeatherItem()
    cityDates=''
    for cityDate in sub.xpath('./h3//text()').extract():
    cityDates+=cityDate
    item['cityDate']=cityDates
    item['week']=sub.xpath('./p//text()').extract()[0]
    item['img']=sub.xpath('./ul/li[1]/img/@src').extract()[0]
    temps=''
    for temp in sub.xpath('./ul/li[2]//text()').extract():
    temps+=temp
    item['temperature']=temps
    item['weather']=sub.xpath('./ul/li[3]//text()').extract()[0]
    item['wind']=sub.xpath('./ul/li[4]//text()').extract()[0]
    items.append(item)
    return items


    (5)修改pipelines.py我,处理Spider的结果:
    # -*- coding: utf-8 -*-

    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import time
    import os.path
    import urllib2
    import sys
    reload(sys)
    sys.setdefaultencoding('utf8')

    class WeatherPipeline(object):
    def process_item(self, item, spider):
    today=time.strftime('%Y%m%d',time.localtime())
    fileName=today+'.txt'
    with open(fileName,'a') as fp:
    fp.write(item['cityDate'].encode('utf-8')+' ')
    fp.write(item['week'].encode('utf-8')+' ')
    imgName=os.path.basename(item['img'])
    fp.write(imgName+' ')
    if os.path.exists(imgName):
    pass
    else:
    with open(imgName,'wb') as fp:
    response=urllib2.urlopen(item['img'])
    fp.write(response.read())
    fp.write(item['temperature'].encode('utf-8')+' ')
    fp.write(item['weather'].encode('utf-8')+' ')
    fp.write(item['wind'].encode('utf-8')+' ')
    time.sleep(1)
    return item


    (6)修改settings.py文件,决定由哪个文件来处理获取的数据:

    (7)执行命令:scrapy crawl HQUSpider

    到此为止,一个完整的Scrapy爬虫就完成了。

















  • 相关阅读:
    C#操作OFFICE一(EXCEL)
    json的使用二(转)
    xpath简介(转载)
    使用LogParser分析IIS网站日志
    WCF Data Contract KnownTypeAttribute转载
    ExtJs中同一个URL构造多个Ext.data.JsonStore 转载
    转载Ext.form.formPanel 与服务器交互 初始化表单
    oracle表数据误删恢复
    JavaScript基础之Array函数
    Json格式类的转换相关代码转载
  • 原文地址:https://www.cnblogs.com/hqutcy/p/7284302.html
Copyright © 2020-2023  润新知