目标:爬取全国报刊名称及地址
链接:http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm
目的:练习scrapy爬取数据
学习过scrapy的基本使用方法后,我们开始写一个最简单的爬虫吧。
目标截图:
1、创建爬虫工程
1
2
|
$ cd ~/code/crawler/scrapyProject $ scrapy startproject newSpapers |
2、创建爬虫程序
1
2
|
$ cd newSpapers/ $ scrapy genspider nationalNewspaper news.xinhuanet.com |
3、配置数据爬取项
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
$ cat items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class NewspapersItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() addr = scrapy.Field() |
4、 配置爬虫程序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
$ cat spiders/nationalNewspaper.py # -*- coding: utf-8 -*- import scrapy from newSpapers.items import NewspapersItem class NationalnewspaperSpider(scrapy.Spider): name = "nationalNewspaper" allowed_domains = [ "news.xinhuanet.com" ] start_urls = [ 'http://news.xinhuanet.com/zgjx/2007-09/13/content_6714741.htm' ] def parse(self, response): sub_country = response.xpath( '//*[@id="Zoom"]/div/table/tbody/tr[2]' ) sub2_local = response.xpath( '//*[@id="Zoom"]/div/table/tbody/tr[4]' ) tags_a_country = sub_country.xpath( './td/table/tbody/tr/td/p/a' ) items = [] for each in tags_a_country: item = NewspapersItem() item[ 'name' ] = each.xpath( './strong/text()' ).extract() item[ 'addr' ] = each.xpath( './@href' ).extract() items.append(item) return items |
5、配置谁去处理爬取结果
1
2
3
4
|
$ cat settings.py …… #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' ITEM_PIPELINES = { 'newSpapers.pipelines.NewspapersPipeline' :100} |
6、配置数据处理程序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
$ cat pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import time class NewspapersPipeline( object ): def process_item(self, item, spider): now = time.strftime( '%Y-%m-%d' ,time.localtime()) filename = 'newspaper.txt' print '=================' print item print '================' with open(filename, 'a' ) as fp: fp.write(item[ 'name' ][0].encode( "utf8" )+ ' ' +item[ 'addr' ][0].encode( "utf8" ) + '
' ) return item |
7、查看结果
1
2
3
4
5
6
7
|
$ cat spiders/newspaper.txt 人民日报 http: //paper.people.com.cn/rmrb/html/2007-09/20/node_17.htm 海外版 http: //paper.people.com.cn/rmrbhwb/html/2007-09/20/node_34.htm 光明日报 http: //www.gmw.cn/01gmrb/2007-09/20/default.htm 经济日报 http: //www.economicdaily.com.cn/no1/ 解放军报 http: //www.gmw.cn/01gmrb/2007-09/20/default.htm 中国日报 http: //pub1.chinadaily.com.cn/cdpdf/cndy/ |