【请初学者作为参考,不建议高手看这个浪费时间】
前两篇大概讲述了scrapy的安装及工作流程。这篇文章主要以一个实例来介绍scrapy的开发流程,本想以教程自带的dirbot作为例子,但感觉大家应该最先都尝试过这个示例,应该都很熟悉,这里不赘述,所以,将用笔者自己第一个较为完整的抓取程序作为示例作为讲解。
首先,要大规模抓取一个网站的内容,必要的资源便是代理ip这一资源,如果不使用代理ip,又追求抓取的速度,很可能会被被抓网站发现行踪并封掉抓取机,所以抓取大量可用的代理ip便是我们第一个任务。
大概这个爬虫要实现以下三个功能:
1. 抓取代理ip,端口信息
2. 验证代理ip,判断其透明性
3. 将可用的代理ip持久化到文件中以供后续抓取程序使用
http://www.cnproxy.com/ 代理服务器网便是一个很好的代理ip的来源,简单看一下,共有12个页面,页面格式相同:
http://www.cnproxy.com/proxy1.html
…
http://www.cnproxy.com/proxy10.html
http://www.cnproxy.com/proxyedu1.html
http://www.cnproxy.com/proxyedu2.html
准备就绪,下面正式开始:
1. 定义item,根据需求,抓取的item最后应该包含如下信息才好用:
#前4个数据为页面可以直接获取的
ip地址
端口
协议类型
地理位置
#后三个数据为pipeline中后期得到的数据,很有用
代理类型
延迟
时间戳
所以定义代码如下:
1 # Define here the models for your scraped items 2 # 3 # See documentation in: 4 # http://doc.scrapy.org/topics/items.html 5 6 from scrapy.item import Item, Field 7 8 class ProxyItem(Item): 9 address = Field() 10 port = Field() 11 protocol = Field() 12 location = Field() 13 14 type = Field() # 0: anonymity #1 nonanonymity 15 delay = Field() # in second 16 timestamp = Field()
2. 定义爬虫
爬虫中做的主要工作就是设置初始化urls,即【种子】,【可不是那种种子~是这种种子~】
然后在默认的parse函数中使用xpath可以轻松的获得所需要的字段,比如
addresses = hxs.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
可以获得ip信息的数组
locations = hxs.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>')
可以获得地理位置信息的数组
唯一有点麻烦的就是端口信息,网站站长可能想到数据会被抓取的可能,所以端口的输出使用了js输出,这确实增加了抓取的难度,scrapy和一般的爬虫是不具备浏览器的javascript解释器的,也不会执行这些js代码,所以,爬虫拿到的html代码中的端口号好没有被输出出来~
源码是这个样子
所以,一定在html的上部,会有这些‘r’的定义,不难发现
经过几次刷新,发现这些定义并不是动态的,所以就简单些,直接拿到代码中的+r+d+r+d信息,将+号替换为空,将r替换成8,d替换成0即可,所以可以声明下边这样的一个map
port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''}
具体的爬虫代码如下
1 from scrapy.contrib.spiders import CrawlSpider, Rule 2 from scrapy.selector import HtmlXPathSelector 3 from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 4 from proxy.items import ProxyItem 5 import re 6 7 class ProxycrawlerSpider(CrawlSpider): 8 name = 'cnproxy' 9 allowed_domains = ['www.cnproxy.com'] 10 indexes = [1,2,3,4,5,6,7,8,9,10] 11 start_urls = [] 12 for i in indexes: 13 url = 'http://www.cnproxy.com/proxy%s.html' % i 14 start_urls.append(url) 15 start_urls.append('http://www.cnproxy.com/proxyedu1.html') 16 start_urls.append('http://www.cnproxy.com/proxyedu2.html') 17 18 def parse(self, response): 19 hxs = HtmlXPathSelector(response) 20 addresses = hxs.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}') 21 protocols = hxs.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>') 22 locations = hxs.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>') 23 ports_re = re.compile('write\(":"(.*)\)') 24 raw_ports = ports_re.findall(response.body); 25 port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''} 26 ports = [] 27 for port in raw_ports: 28 tmp = port 29 for key in port_map: 30 tmp = tmp.replace(key, port_map[key]); 31 ports.append(tmp) 32 items = [] 33 for i in range(len(addresses)): 34 item = ProxyItem() 35 item['address'] = addresses[i] 36 item['protocol'] = protocols[i] 37 item['location'] = locations[i] 38 item['port'] = ports[i] 39 items.append(item) 40 return items 41 ~
3. 执行pipeline,过滤并检查抓取到的代理ip,并将其持久化到文件中
一般性的校验过程我就不赘述了,直接介绍如果验证代理可用性及透明性的方法。
这需要另一个cgi程序的帮助。简单来说一个代理是否透明,就是在做中转的时候是否会将源ip放到请求包中并能够被被抓取方获取,如果能,就说明这个代理不是透明的,使用的时候就要多留意。
一般非透明代理ip会将源ip放到HTTP_X_FORWARDED_FOR字段中,为了更严谨些,另一个cgi脚本将服务器能获取到的所有跟ip有关的数据echo出来,php代码如下:
1 <?php 2 3 echo "PROXYDETECTATION</br>"; 4 echo "REMOTE_ADDR</br>"; 5 var_dump($_SERVER['REMOTE_ADDR']); 6 echo "</br>"; 7 echo "env_REMOTE_ADDR</br>"; 8 var_dump(getenv('REMOTE_ADDR')); 9 echo "</br>"; 10 echo "env_HTTP_CLIENT_IP</br>"; 11 var_dump(getenv('HTTP_CLIENT_IP')); 12 echo "</br>"; 13 echo "HTTP_CLIENT_IP</br>"; 14 var_dump($_SERVER['HTTP_CLIENT_IP']); 15 echo "</br>"; 16 echo "HTTP_X_FORWARDED_FOR</br>"; 17 var_dump($_SERVER['HTTP_X_FORWARDED_FOR']); 18 echo "</br>"; 19 echo "HTTP_X_FORWARDED</br>"; 20 var_dump($_SERVER['HTTP_X_FORWARDED']); 21 echo "</br>"; 22 echo "HTTP_X_CLUSTER_CLIENT_IP</br>"; 23 var_dump($_SERVER['HTTP_X_CLUSTER_CLIENT_IP']); 24 echo "</br>"; 25 echo "HTTP_FORWARDED_FOR</br>"; 26 var_dump($_SERVER['HTTP_FORWARDED_FOR']); 27 echo "</br>"; 28 echo "HTTP_FORWARDED</br>"; 29 var_dump($_SERVER['HTTP_FORWARDED']); 30 echo "</br>"; 31 32 ?>
假设这个服务地址为http://xxx.xxx.xxx.xxx/apps/proxydetect.php
那么pipelines的代码如下
1 # Define your item pipelines here 2 # Don't forget to add your pipeline to the ITEM_PIPELINES setting 3 # See: http://doc.scrapy.org/topics/item-pipeline.html 4 from scrapy.exceptions import DropItem 5 import re 6 import urllib 7 import urllib2 8 import time 9 import exceptions 10 import socket 11 class ProxyPipeline(object): 12 def process_item(self, item, spider): 13 port = item['port'] 14 port_re = re.compile('\d{1,5}') 15 ports = port_re.findall(port) 16 if len(ports) == 0: 17 raise DropItem("can not find port in %s" % item['port']) 18 else: 19 item['port'] = ports[0] 20 #profiling the proxy 21 #detect_service_url = 'http://xxx.xxx.xxx.xxx:pppp/apps/proxydetect.php' 22 detect_service_url = 'http://xxx.xxx.xxx.xxx/apps/proxydetect.php' 23 local_ip = 'xxx.xxx.xxx.xxx' 24 proxy_ = str('http://%s:%s' % (str(item['address']), str(item['port']))) 25 proxies = {'http':proxy_} 26 begin_time = time.time() 27 timeout = 1 28 socket.setdefaulttimeout(timeout) 29 try: 30 data = urllib.urlopen(detect_service_url, proxies=proxies).read() 31 except exceptions.IOError: 32 raise DropItem("curl download the proxy %s:%s is bad" % (item['address'],str(item['port']))) 33 34 end_time = time.time() 35 if '' == data.strip(): 36 raise DropItem("data is null the proxy %s:%s is bad" % (item['address'],str(item['port']))) 37 if data.find('PROXYDETECTATION') == -1: 38 raise DropItem("wrong response the proxy %s:%s is bad" % (item['address'],str(item['port']))) 39 if data.find('PROXYDETECTATION') != -1: 40 if data.find(local_ip) == -1: 41 item['type'] = 'anonymity' 42 else: 43 item['type'] = 'nonanonymity' 44 item['delay'] = str(end_time - begin_time) 45 item['timestamp'] = time.strftime('%Y-%m-%d',time.localtime(time.time())) 46 47 #record the item info 48 fp = open('/home/xxx/services_runenv/crawlers/proxy/proxy/data/proxies.txt','a') 49 line = str(item['timestamp']) + '\t' + str(item['address']) + '\t' + str(item['port']) + '\t' + item['type'] + '\t' + str(item['delay']) + '\n' 50 fp.write(line) 51 fp.close() 52 return item
其中local_ip为抓取服务器本机的ip地址
4. 最后一个要说的就是setting.py配置文件,大家看具体代码吧
1 # Scrapy settings for proxy project 2 # 3 # For simplicity, this file contains only the most important settings by 4 # default. All the other settings are documented here: 5 # 6 # http://doc.scrapy.org/topics/settings.html 7 # 8 9 BOT_NAME = 'proxy' 10 BOT_VERSION = '1.0' 11 12 SPIDER_MODULES = ['proxy.spiders'] 13 NEWSPIDER_MODULE = 'proxy.spiders' 14 USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION) 15 16 DOWNLOAD_DELAY = 0 17 DOWNLOAD_TIMEOUT = 30 18 19 ITEM_PIPELINES = [ 20 'proxy.pipelines.ProxyPipeline' 21 ] 22 CONCURRENT_ITEMS = 100 23 CONCURRENT_REQUESTS_PER_SPIDER = 64 24 CONCURRENT_SPIDERS = 128 25 26 LOG_ENABLED = True 27 LOG_ENCODING = 'utf-8' 28 LOG_FILE = '/home/xxx/services_runenv/crawlers/proxy/proxy/log/proxy.log' 29 LOG_LEVEL = 'DEBUG' 30 LOG_STDOUT = False
最后秀一下抓取到的代理ip数据
好了,这篇就这些,下一篇将介绍如果使用代理ip作为媒介,放心的去大规模抓取网站数据,晚安。