• (二)模拟浏览器抓取网页


    第一节: 设置请求头消息 User-Agent 模拟浏览器

    HttpClient设置请求头消息User-Agent模拟浏览器

    比如我们请求 www.tuicool.com

    用前面的代码:

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo1 {
    11     
    12     public static void main(String[] args)throws Exception {
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
    15         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    16         HttpEntity entity=response.getEntity(); // 获取返回实体
    17         System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    18         response.close(); // response关闭
    19         httpClient.close(); // httpClient关闭
    20     }
    21 
    22 }

    返回内容:

    网页内容:<!DOCTYPE html>
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    </head>
    <body>
    <p>系统检测亲不是真人行为,因系统资源限制,我们只能拒绝你的请求。如果你有疑问,可以通过微博 http://weibo.com/tuicool2012/ 联系我们。</p>
    </body>
    </html>

    我们模拟下浏览器 设置下User-Agent头消息:

    加下 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo1 {
    11     
    12     public static void main(String[] args)throws Exception {
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.tuicool.com/"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");    // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         HttpEntity entity=response.getEntity(); // 获取返回实体
    18         System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    19         response.close(); // response关闭
    20         httpClient.close(); // httpClient关闭
    21     }
    22 
    23 }

    运行:

    当然通过火狐firebug,我们还可以看到其他请求头消息:

    都是可以通过setHeader方法 设置key value;来得到模拟浏览器请求;


    第二节: 获取响应内容类型 Content-Type

    HttpClient获取响应内容类型Content-Type

    响应的网页内容都有类型也就是Content-Type

    通过火狐firebug,我们看响应头信息:

    当然我们可以通过HttpClient接口来获取;

    HttpEntity的getContentType().getValue() 就能获取到响应类型;  

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例
    15         //HttpGet httpGet=new HttpGet("http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar"); // 创建httpget实例
    16         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    17         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Content-Type:text/html; charset=utf-8

    一般网页是text/html当然有些是带编码的,

    比如请求www.tuicool.com:输出:

    Content-Type:text/html; charset=utf-8

    假如请求js文件,比如 http://www.javaxk.com/include/dedeajax2.js

    运行输出:

    Content-Type:application/javascript

    假如请求的是文件,比如 http://central.maven.org/maven2/HTTPClient/HTTPClient/0.3-3/HTTPClient-0.3-3.jar

    运行输出:

    Content-Type:application/java-archive

    当然Content-Type还有一堆,那这东西对于我们爬虫有啥用的,我们再爬取网页的时候 ,可以通过

    Content-Type来提取我们需要爬取的网页或者是爬取的时候,需要过滤掉的一些网页;


    第三节: 获取响应状态 Status

    200 正常
    403 拒绝
    500 服务器报错
    400 未找到页面



    HttpClient获取响应状态Status

    我们HttpClient向服务器请求时,

    正常情况 执行成功 返回200状态码,

    不一定每次都会请求成功,

    比如这个请求地址不存在 返回404

    服务器内部报错 返回500

    有些服务器有防采集,假如你频繁的采集数据,则返回403 拒绝你请求。

    当然 我们是有办法的 后面会讲到用代理IP。

    这个获取状态码,我们可以用 CloseableHttpResponse对象的getStatusLine().getStatusCode()

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         System.out.println("Status:"+response.getStatusLine().getStatusCode());
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Status:200

    Content-Type:text/html;charset=UTF-8

    假如换个页面 http://www.javaxk.com/a.jsp

    因为不存在,

    所以返回 404

     1 package com.javaxk.httpclient.chap02;
     2 
     3 import org.apache.http.HttpEntity;
     4 import org.apache.http.client.methods.CloseableHttpResponse;
     5 import org.apache.http.client.methods.HttpGet;
     6 import org.apache.http.impl.client.CloseableHttpClient;
     7 import org.apache.http.impl.client.HttpClients;
     8 import org.apache.http.util.EntityUtils;
     9 
    10 public class Demo2 {
    11     
    12     public static void main(String[] args) throws Exception{
    13         CloseableHttpClient httpClient=HttpClients.createDefault(); // 创建httpClient实例
    14         HttpGet httpGet=new HttpGet("http://www.javaxk.com/a.jsp"); // 创建httpget实例
    15         httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"); // 设置请求头消息User-Agent
    16         CloseableHttpResponse response=httpClient.execute(httpGet); // 执行http get请求
    17         System.out.println("Status:"+response.getStatusLine().getStatusCode());
    18         HttpEntity entity=response.getEntity(); // 获取返回实体
    19         System.out.println("Content-Type:"+entity.getContentType().getValue());
    20         //System.out.println("网页内容:"+EntityUtils.toString(entity, "utf-8")); // 获取网页内容
    21         response.close(); // response关闭
    22         httpClient.close(); // httpClient关闭
    23     }
    24 
    25 }

    运行输出:

    Status:404
    Content-Type:text/html

  • 相关阅读:
    python基础(五)——CGI编程
    python基础(六)——mysql的使用
    python基础(七)——网络编程
    python基础(八)——多线程
    python面试题
    linux日志管理
    linux之nagios安装教程
    【华为云技术分享】盘点物联网常用开发板
    数据库“意外失联”?华为云DRS异地多活灾备为您支招
    如何处理暗数据?
  • 原文地址:https://www.cnblogs.com/wishwzp/p/7059040.html
Copyright © 2020-2023  润新知