转载 http://blog.csdn.net/rongyongfeikai2/article/details/7826057
看过Robin的一篇文章,就是反爬虫的。他提到了几种反爬虫的方法:1.手工拒绝,即爬虫的并发量相当高,那么按照80端口进行并发排序,然后手动的把爬虫的IP给禁掉。2.根据User-Agent拒绝,比如如果我们用Java程序进行爬取时,如果没有设header的话,User-Agent就是java,那么就禁掉User-Agent不为浏览器那样的请求。3.根据流量统计和日志分析来屏蔽爬虫,封掉流量特别大的爬虫。4.实时屏蔽,即如果一个IP在一段时间内请求特别频繁,就为爬虫,加入黑名单,不再响应后续请求。
高并发的爬虫却是会对网站的服务器造成很大的压力,但是有时候我们需要从ITEYE或者CSDN上爬取一些东西时,也被拒绝掉了。(CSDN博客爬取时报403拒绝请求)
很明显,1,3,4,条对我们无效,因为我们的爬取不是高并发的频繁的;第2条,User-Agent的判别,才是封掉我们爬取的真正原因。那么,我们就只能加入头部,让自己的爬取像是一个浏览器请求的样子。那么浏览器请求时,发出的是怎样的数据包呢?
我们可以写个程序在10086端口监听一下(端口你自己随便取):
- package com.JavaUtil.IESimilator;
- import java.io.*;
- import java.net.*;
- import java.util.*;
- public class IEHeaderTest {
- //在端口10086上监听,得到IE发送的数据包
- public IEHeaderTest() {
- int port = 10086;
- ServerSocket serverSocket = null;
- Socket client = null;
- BufferedInputStream bis = null;
- try{
- serverSocket = new ServerSocket(port);
- client = serverSocket.accept();
- bis = new BufferedInputStream(client.getInputStream());
- int index = -1;
- byte[] buffer = new byte[1024];
- while((index=bis.read(buffer))!=-1){
- System.out.println(new String(buffer,0,index));
- }
- }catch(Exception ex){
- ex.printStackTrace();
- }finally{
- if(bis!=null){
- try {
- bis.close();
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
- if(client!=null){
- try {
- client.close();
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
- if(serverSocket!=null){
- try {
- serverSocket.close();
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
- }
- }
- public static void main(String[] args) {
- IEHeaderTest headerTest = new IEHeaderTest();
- }
- }
然后再在浏览器中输入:http://localhost:10086/ 我们就可以得到IE发送的信息:
GET / HTTP/1.1
Accept: image/jpeg, application/x-ms-application, image/gif, application/xaml+xml, image/pjpeg, application/x-ms-xbap, application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint, */*
Accept-Language: zh-CN
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET4.0C)
Accept-Encoding: gzip, deflate
Host: localhost:10086
Connection: Keep-Alive
我们可以看到,只有我们把User-Agent设置好,就不会出现爬取被拒绝的问题了:
- package com.JavaUtil.IESimilator;
- import java.io.BufferedInputStream;
- import java.io.BufferedReader;
- import java.io.IOException;
- import java.io.InputStreamReader;
- import java.io.UnsupportedEncodingException;
- import java.util.ArrayList;
- import java.util.List;
- import org.apache.commons.httpclient.Header;
- import org.apache.commons.httpclient.HttpClient;
- import org.apache.commons.httpclient.HttpException;
- import org.apache.commons.httpclient.cookie.CookiePolicy;
- import org.apache.commons.httpclient.methods.GetMethod;
- import org.apache.commons.httpclient.params.DefaultHttpParams;
- /*
- * author:Tammy Pi
- */
- public class IESimilatorFetchCSDN {
- private HttpClient httpClient = new HttpClient();
- private GetMethod getMethod = null;
- private BufferedReader bis = null;
- private String rtn = null;
- //get page
- public String getPage(String url){
- StringBuilder sb = new StringBuilder();
- getMethod = new GetMethod(url);
- //set http header
- List<Header> headers = new ArrayList<Header>();
- headers.add(new Header("Accept"," image/jpeg, application/x-ms-application, image/gif, application/xaml+xml, image/pjpeg, application/x-ms-xbap, application/msword, application/vnd.ms-excel, application/vnd.ms-powerpoint, */*"));
- headers.add(new Header("User-Agent","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET CLR 1.1.4322; .NET4.0C)"));
- headers.add(new Header("Connection","Keep-Alive"));
- //设置Cookie,解决cookie reject问题
- DefaultHttpParams.getDefaultParams().setParameter("http.protocol.cookie-policy", CookiePolicy.BROWSER_COMPATIBILITY);
- httpClient.getHostConfiguration().getParams().setParameter("http.default-headers",headers);
- try {
- //设置编码格式
- int status = httpClient.executeMethod(getMethod);
- System.out.println("status:"+status);
- bis = new BufferedReader(new InputStreamReader(getMethod.getResponseBodyAsStream(),getMethod.getResponseCharSet()));
- String line = null;
- while((line=bis.readLine())!=null){
- sb.append(line);
- }
- try {
- rtn = new String(sb.toString().getBytes(getMethod.getResponseCharSet()),"utf-8");
- } catch (UnsupportedEncodingException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- } catch (HttpException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- } finally{
- if(getMethod!=null){
- getMethod.releaseConnection();
- }
- }
- return rtn;
- }
- public static void main(String[] args){
- IESimilatorFetchCSDN similator = new IESimilatorFetchCSDN();
- String rtn = similator.getPage("http://blog.csdn.net/hdhtqq/article/details/6088461");
- System.out.println(rtn);
- }
- }