本文转自:http://mercymessi.iteye.com/blog/2250161
httpclient是Apache下的一个用于执行http网络访问的一个工具包。
大致流程:新建一个httpclient对象->新建一个httpRequest对象->用httpclient去执行->得到一个response->通过解析这个response来获取自己所需要的信息。
一、新建httpClient对象:
在httpClient4.5中,初始化的方式已经和以前版本有一些不同。
有大致以下几种方式:
- static CloseableHttpClient client = HttpClients.createDefault();
- //最好使用static修饰,以保证用同一个client对象处理请求,以保存进度
- static CloseableHttpClient httpClient=HttpClients.custom().build();
此二种都是新建一个默认的httpClient对象。可以在第二种方法里添加一些网络访问选项设置。
- /**
- * initialize a instance of the httpClient depending on your own request
- */
- private static CloseableHttpClient getInstanceClient() {
- CloseableHttpClient httpClient;
- StandardHttpRequestRetryHandler standardHandler = new StandardHttpRequestRetryHandler(5, true);
- HttpRequestRetryHandler handler = new HttpRequestRetryHandler() {
- @Override
- public boolean retryRequest(IOException arg0, int retryTimes, HttpContext arg2) {
- if (arg0 instanceof UnknownHostException || arg0 instanceof ConnectTimeoutException
- || !(arg0 instanceof SSLException) || arg0 instanceof NoHttpResponseException) {
- return true;
- }
- if (retryTimes > 5) {
- return false;
- }
- HttpClientContext clientContext = HttpClientContext.adapt(arg2);
- HttpRequest request = clientContext.getRequest();
- boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);
- if (idempotent) {
- // 如果请求被认为是幂等的,那么就重试。即重复执行不影响程序其他效果的
- return true;
- }
- return false;
- }
- };
- HttpHost proxy = new HttpHost("127.0.0.1", 80);// 设置代理ip
- DefaultProxyRoutePlanner routePlanner = new DefaultProxyRoutePlanner(proxy);
- httpClient = HttpClients.custom().setRoutePlanner(routePlanner).setRetryHandler(handler)
- .setConnectionTimeToLive(1, TimeUnit.DAYS).setDefaultCookieStore(cookieStore).build();
- return httpClient;
- }
在该代码中分别设置了网络代理,重试处理,对于请求的keepalive时间,指定cookiestore用于保存cookie。
retryHandler:代码里给了两种方式。第一个是简便的用于设置重试,第一个参数为最大重试次数,第二个参数为请求在幂等情况下是否重试。第二种方式详细的规定了在发生了什么exception个下重试,以及幂等和重试次数下的重试情况。
routePlanner:httpClient支持代理。新建一个httphost对象传给一个routeplanner对象即可。httphost的构造方法中可以指定代理ip和端口
CookieStore:需要预先新建一个cookieStore对象。初始化方式如下:
- CookieStore cookieStore = new BasicCookieStore();
二、执行get请求:
先上代码
- /**
- * used to get the html code from the url
- */
- static RequestConfig config = RequestConfig.custom().setConnectTimeout(6000).setSocketTimeout(6000)
- .setCookieSpec(CookieSpecs.STANDARD).build(); // 设置超时及cookie策略
- public static String getDemo(String url) {
- HttpGet get = new HttpGet(url);
- get.setConfig(config);
- HttpResponse response = null;
- String html = null;
- try {
- response = client.execute(get);
- int statusCode = response.getStatusLine().getStatusCode();// 连接代码
- Header[] headers = response.getAllHeaders();
- // 用于得到返回的文件头
- for (Header header : headers) {
- System.out.println(header);
- }
- html = new String(EntityUtils.toString(response.getEntity()).getBytes("iso8859-1"), "gb2312");
- // 在后面参数输入网站的编码,一般为utf-8
- // 返回的html代码,避免发生编码错误
- System.out.println(html);
- } catch (IOException e) {
- e.printStackTrace();
- }
- return html;
- }
大致流程:新建httpget对象->用httpClient执行->解析返回的response得到自己需要的内容
cookieSpec:即cookie策略。参数为cookiespecs的一些字段。作用:1、如果网站header中有set-cookie字段时,采用默认方式可能会被cookie reject,无法写入cookie。将此属性设置成CookieSpecs.STANDARD_STRICT可避免此情况。2、如果要想忽略cookie访问,则将此属性设置成CookieSpecs.IGNORE_COOKIES。
tips:注意网站编码,否则容易出现乱码
也可以通过uri进行构建httpget:
URI uri = new URIBuilder() .setScheme("http") .setHost("www.google.com") .setPath("/search") .setParameter("q", "httpclient") .setParameter("btnG", "Google Search") .setParameter("aq", "f") .setParameter("oq", "") .build(); HttpGet httpget = new HttpGet(uri); System.out.println(httpget.getURI());//http://www.google.com/search?q=httpclient&btnG=Google+Search&aq=f&oq=
三、执行post请求:
- /**
- * used to post form data which is the url needed
- */
- public static void postDemo(String url) {
- HttpPost post = new HttpPost(url);
- post.setConfig(config);
- post.setHeader("User-Agent",
- "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36");
- post.setHeader("Connection", "keep-alive");
- List<NameValuePair> list = new ArrayList<NameValuePair>();
- list.add(new BasicNameValuePair("key", "value"));
- list.add(new BasicNameValuePair("key", "value"));
- list.add(new BasicNameValuePair("key", "value"));
- list.add(new BasicNameValuePair("key", "value"));
- list.add(new BasicNameValuePair("key", "value"));
- try {
- HttpEntity entity = new UrlEncodedFormEntity(list, "utf-8");
- post.setEntity(entity);
- HttpResponse response = client.execute(post);
- String responseHtml = EntityUtils.toString(response.getEntity());
- System.out.println(responseHtml);
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
大致流程:新建post对象->新建需要的表单页->将表单内容设置入请求中->执行并获得response
四:解析response:
得到html code:
- String responseHtml = EntityUtils.toString(response.getEntity());
得到http状态码:
- int statusCode = response.getStatusLine().getStatusCode();// 连接返回状态代码,200,400等
-
System.out.println(response.getProtocolVersion());//HTTP/1.1
-
System.out.println(response.getStatusLine().getReasonPhrase());//OK
-
System.out.println(response.getStatusLine().toString());//HTTP/1.1 200 OK
得到response header:
- response.getFirstHeader("key");// 得到第一个名字为key的header
- response.getHeaders("key");// 得到名字为key的所有header,返回一个数组
- response.getLastHeader("key");
得到inputstream:(下载网络部分资源的时候有可能会对cookie有要求,此时需要用到httpClient来下载。)例如验证码等等。
- InputStream inputStream = response.getEntity().getContent();
五:管理cookie:
httpClient里默认自动管理cookie,如果想要提取cookie或者发送自定义的cookie,则需要在httpClient对象初始化时设置一个默认的cookiestore来保存。(方法见初始化httpClient对象里的setDefaultCookieStore)。
得到当前所有cookie:
- List<Cookie> list = cookieStore.getCookies();// get all cookies
- System.out.println("cookie is:");
- System.out.println("-----------------------");
- for (Cookie cookie : list) {
- System.out.println(cookie);
- }
- System.out.println("-----------------------");
清除所有cookie:
- cookieStore.clear();
发送自定义cookie:(new了一个对象之后可以设置多种属性。)
- BasicClientCookie cookie = new BasicClientCookie("name", "value");
- // new a cookie
- cookie.setDomain("domain");
- cookie.setExpiryDate(new Date());
- // set the properties of the cookie
- cookieStore.addCookie(cookie);
最后通过按得到addCookie将其加入cookieStore。(如有相同的name的cookie将会覆盖,个人觉得类似hashmap的put操作。)
六:管理header:
在平常抓取过程中,经常需要在请求中加入许多header伪装成一个正常的浏览器。以免被服务器认出是爬虫而被封。
设置一些常见header:
- post.setHeader("User-Agent",
- "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36");
- post.setHeader("Connection", "keep-alive");
注意:下载某些网站的资源时,服务器会获取你的来源站,并发出对应的相应。如果来源站不对,可能会被服务器拒绝。此时只需要在请求中加个header就行。
- get1.setHeader("Referer", "http://www.a.com");
-
response.addHeader("Set-Cookie",
"c1=a; path=/; domain=localhost"); -
response.addHeader("Set-Cookie",
"c2=b; path="/", c3=c; domain="localhost""); -
Header h1 = response.getFirstHeader("Set-Cookie");
System.out.println(h1);//Set-Cookie: c1=a; path=/; domain=localhost -
Header h2 = response.getLastHeader("Set-Cookie");
System.out.println(h2);//Set-Cookie: c2=b; path="/", c3=c; domain="localhost" -
Header[] hs = response.getHeaders("Set-Cookie");
System.out.println(hs.length);//2 -
HeaderIterator it = response.headerIterator("Set-Cookie");
while (it.hasNext()) {
System.out.println(it.next());
} -
HeaderElementIterator it = new BasicHeaderElementIterator(
response.headerIterator("Set-Cookie"));while (it.hasNext()) {
HeaderElement elem = it.nextElement();
System.out.println(elem.getName() + " = " + elem.getValue());
NameValuePair[] params = elem.getParameters();
for (int i = 0; i < params.length; i++) {
System.out.println(" " + params[i]);
}
}
HTTP实体
HTTP消息可以包含内容实体,HTTP定义了两个实体封装请求方法:PUT和POST。HttpClient依靠内容的来源来区分三种实体。
streamed:内容来源于流或者动态生成,特别是,包含从HTTP响应接收的实体,streamed实体一般不可重复生成的。
self-contained:内容位于内存中或者是可获得的,意味着它是独立于连接和其他实体的,Self-contained实体一般可重复,这种类型的实体大都用于HTTP请求的封装。
wrapping:内容来源于其他实体。
对于连接管理来说,当从HTTP响应中用流输出内容的时候这些区分的重要的。对于仅仅由应用程序创建并且用HttpClient发送的请求实体来说,streamed和self-contained的区别是不重要的。既然如此,那么就认为不可重复的实体是streamed,可重复的实体是self-contained。
可重复的实体,表示它的内容可以不止一次被读取,例如ByteArrayEntity和StringEntity。为了读取内容,任何人都可以使用HttpEntity#getContent()返回java.io.InputStream,或者用HttpEntity#writeTo(OutputStream)提供给输出流。
当实体通过一个收到的报文获取时,HttpEntity#getContentType()方法和HttpEntity#getContentLength()方法可以用来读取通用的元数据,如Content-Type和Content-Length头部信息(如果它们是可用的)。因为头部信息Content-Type可以包含对文本MIME类型的字符编码,比如text/plain或text/html,HttpEntity#getContentEncoding()方法用来读取这个信息。如果头部信息Content-Length不可用,那么就返回长度-1,而对于内容类型返回NULL。如果头部信息Content-Type是可用的,那么就会返回一个Header对象。
- StringEntity myEntity = new StringEntity("important message",
- ContentType.create("text/plain", "UTF-8"));
- System.out.println(myEntity.getContentType());
- System.out.println(myEntity.getContentLength());
- System.out.println(EntityUtils.toString(myEntity));
- System.out.println(EntityUtils.toByteArray(myEntity).length);
输出
- Content-Type: text/plain; charset=utf-8
- 17
- important message
- 17
确保低级别资源释放
- CloseableHttpClient httpclient = HttpClients.createDefault();
- HttpGet httpget = new HttpGet("http://localhost/");
- CloseableHttpResponse response = httpclient.execute(httpget);
- try {
- HttpEntity entity = response.getEntity();
- if (entity != null) {
- InputStream instream = entity.getContent();
- try {
- // do something useful
- } finally {
- instream.close();
- }
- }
- } finally {
- response.close();
- }
- CloseableHttpClient httpclient = HttpClients.createDefault();
- HttpGet httpget = new HttpGet("http://localhost/");
- CloseableHttpResponse response = httpclient.execute(httpget);
- try {
- HttpEntity entity = response.getEntity();
- if (entity != null) {
- InputStream instream = entity.getContent();
- int byteOne = instream.read();
- int byteTwo = instream.read();
- // Do not need the rest
- }
- } finally {
- response.close();
- }
消耗实体内容
- CloseableHttpClient httpclient = HttpClients.createDefault();
- HttpGet httpget = new HttpGet("http://localhost/");
- CloseableHttpResponse response = httpclient.execute(httpget);
- try {
- HttpEntity entity = response.getEntity();
- if (entity != null) {
- long len = entity.getContentLength();
- if (len != -1 && len < 2048) {
- System.out.println(EntityUtils.toString(entity));
- } else {
- // Stream content out
- }
- }
- } finally {
- response.close();
- }
- CloseableHttpResponse response = <...>
- HttpEntity entity = response.getEntity();
- if (entity != null) {
- entity = new BufferedHttpEntity(entity);
- }
生成实体内容
- File file = new File("somefile.txt");
- FileEntity entity = new FileEntity(file,
- ContentType.create("text/plain", "UTF-8"));
- HttpPost httppost = new HttpPost("http://localhost/action.do");
- httppost.setEntity(entity);
HTML表单
- List<NameValuePair> formparams = new ArrayList<NameValuePair>();
- formparams.add(new BasicNameValuePair("param1", "value1"));
- formparams.add(new BasicNameValuePair("param2", "value2"));
- UrlEncodedFormEntity entity = new UrlEncodedFormEntity(formparams, Consts.UTF_8);
- HttpPost httppost = new HttpPost("http://localhost/handler.do");
- httppost.setEntity(entity);
内容分块
- StringEntity entity = new StringEntity("important message",
- ContentType.create("plain/text", Consts.UTF_8));
- entity.setChunked(true);
- HttpPost httppost = new HttpPost("http://localhost/acrtion.do");
- httppost.setEntity(entity);
response处理
- CloseableHttpClient httpclient = HttpClients.createDefault();
- HttpGet httpget = new HttpGet("http://localhost/json");
- ResponseHandler<MyJsonObject> rh = new ResponseHandler<MyJsonObject>() {
- @Override
- public JsonObject handleResponse(
- final HttpResponse response) throws IOException {
- StatusLine statusLine = response.getStatusLine();
- HttpEntity entity = response.getEntity();
- if (statusLine.getStatusCode() >= 300) {
- throw new HttpResponseException(
- statusLine.getStatusCode(),
- statusLine.getReasonPhrase());
- }
- if (entity == null) {
- throw new ClientProtocolException("Response contains no content");
- }
- Gson gson = new GsonBuilder().create();
- ContentType contentType = ContentType.getOrDefault(entity);
- Charset charset = contentType.getCharset();
- Reader reader = new InputStreamReader(entity.getContent(), charset);
- return gson.fromJson(reader, MyJsonObject.class);
- }
- };
- MyJsonObject myjson = client.execute(httpget, rh);
HttpClient的接口
- ConnectionKeepAliveStrategy keepAliveStrat = new DefaultConnectionKeepAliveStrategy() {
- @Override
- public long getKeepAliveDuration(
- HttpResponse response,
- HttpContext context) {
- long keepAlive = super.getKeepAliveDuration(response, context);
- if (keepAlive == -1) {
- // Keep connections alive 5 seconds if a keep-alive value
- // has not be explicitly set by the server
- keepAlive = 5000;
- }
- return keepAlive;
- }
- };
- CloseableHttpClient httpclient = HttpClients.custom()
- .setKeepAliveStrategy(keepAliveStrat)
- .build();
HTTPCLIENT的线程安全性
HTTPCLIENT资源分配
- CloseableHttpClient httpclient = HttpClients.createDefault();
- try {
- <...>
- } finally {
- httpclient.close();
- }
Http执行上下文
HttpContext可以包含任意类型的对象,因此如果在多线程中共享上下文会不安全。建议每个线程都只包含自己的http上下文。
在Http请求执行的过程中,HttpClient会自动添加下面的属性到Http上下文中:
HttpConnection的实例,表示客户端与服务器之间的连接
HttpHost的实例,表示要连接的目标服务器
HttpRoute的实例,表示全部的连接路由
HttpRequest的实例,表示Http请求。在执行上下文中,最终的HttpRequest对象会代表http消息的状态。Http/1.0和Http/1.1都默认使用相对的uri。但是如果使用了非隧道模式的代理服务器,就会使用绝对路径的uri。
HttpResponse的实例,表示Http响应
java.lang.Boolean对象,表示是否请求被成功的发送给目标服务器
RequestConfig对象,表示http request的配置信息
java.util.List<Uri>对象,表示Http响应中的所有重定向地址
可以使用HttpClientContext这个适配器来简化和上下文状态交互的过程。
- HttpContext context = <...>
- HttpClientContext clientContext = HttpClientContext.adapt(context);
- HttpHost target = clientContext.getTargetHost();
- HttpRequest request = clientContext.getRequest();
- HttpResponse response = clientContext.getResponse();
- RequestConfig config = clientContext.getRequestConfig();
在下面的例子中,我们在开头设置的参数,会被保存在上下文中,并且会应用到后续的http请求中。
- CloseableHttpClient httpclient = HttpClients.createDefault();
- RequestConfig requestConfig = RequestConfig.custom()
- .setSocketTimeout(1000)
- .setConnectTimeout(1000)
- .build();
- HttpGet httpget1 = new HttpGet("http://localhost/1");
- httpget1.setConfig(requestConfig);
- CloseableHttpResponse response1 = httpclient.execute(httpget1, context);
- try {
- HttpEntity entity1 = response1.getEntity();
- } finally {
- response1.close();
- }
- HttpGet httpget2 = new HttpGet("http://localhost/2");
- CloseableHttpResponse response2 = httpclient.execute(httpget2, context);
- try {
- HttpEntity entity2 = response2.getEntity();
- } finally {
- response2.close();
- }
HTTP协议拦截器
下面是个例子,讲述了本地的上下文时如何在连续请求中记录处理状态的:
- CloseableHttpClient httpclient = HttpClients.custom()
- .addInterceptorLast(new HttpRequestInterceptor() {
- public void process(
- final HttpRequest request,
- final HttpContext context) throws HttpException, IOException {
- AtomicInteger count = (AtomicInteger) context.getAttribute("count");
- request.addHeader("Count", Integer.toString(count.getAndIncrement()));
- }
- })
- .build();
- AtomicInteger count = new AtomicInteger(1);
- HttpClientContext localContext = HttpClientContext.create();
- localContext.setAttribute("count", count);
- HttpGet httpget = new HttpGet("http://localhost/");
- for (int i = 0; i < 10; i++) {
- CloseableHttpResponse response = httpclient.execute(httpget, localContext);
- try {
- HttpEntity entity = response.getEntity();
- } finally {
- response.close();
- }
- }
异常处理
HTTP传输安全
方法的幂等性
异常自动修复
HttpClient不会尝试修复任何逻辑或者http协议错误(即从HttpException衍生出来的异常)。
HttpClient会自动再次发送幂等的方法(如果首次执行失败)。
HttpClient会自动再次发送遇到transport异常的方法,前提是Http请求仍旧保持着连接(例如http请求没有全部发送给目标服务器,HttpClient会再次尝试发送)。
请求重试HANDLER
- HttpRequestRetryHandler myRetryHandler = new HttpRequestRetryHandler() {
- public boolean retryRequest(
- IOException exception,
- int executionCount,
- HttpContext context) {
- if (executionCount >= 5) {
- // Do not retry if over max retry count
- return false;
- }
- if (exception instanceof InterruptedIOException) {
- // Timeout
- return false;
- }
- if (exception instanceof UnknownHostException) {
- // Unknown host
- return false;
- }
- if (exception instanceof ConnectTimeoutException) {
- // Connection refused
- return false;
- }
- if (exception instanceof SSLException) {
- // SSL handshake exception
- return false;
- }
- HttpClientContext clientContext = HttpClientContext.adapt(context);
- HttpRequest request = clientContext.getRequest();
- boolean idempotent = !(request instanceof HttpEntityEnclosingRequest);
- if (idempotent) {
- // Retry if the request is considered idempotent
- return true;
- }
- return false;
- }
- };
- CloseableHttpClient httpclient = HttpClients.custom()
- .setRetryHandler(myRetryHandler)
- .build();
中断请求
重定向处理
- LaxRedirectStrategy redirectStrategy = new LaxRedirectStrategy();
- CloseableHttpClient httpclient = HttpClients.custom()
- .setRedirectStrategy(redirectStrategy)
- .build();
- CloseableHttpClient httpclient = HttpClients.createDefault();
- HttpClientContext context = HttpClientContext.create();
- HttpGet httpget = new HttpGet("http://localhost:8080/");
- CloseableHttpResponse response = httpclient.execute(httpget, context);
- try {
- HttpHost target = context.getTargetHost();
- List<URI> redirectLocations = context.getRedirectLocations();
- URI location = URIUtils.resolve(httpget.getURI(), target, redirectLocations);
- System.out.println("Final HTTP location: " + location.toASCIIString());
- // Expected to be an absolute URI
- } finally {
- response.close();
- }
ps:
1、爬虫也要遵守基本法,在多次请求的之中为了不给对方服务器造成负担(避免被封),尽量在请求间sleep一个随机数值。
2、爬取非英文网站时注意编码格式,国内一般为utf-8,也有一些是gb2312.获取时注意转码。
3、多获得一些可靠IP(备胎),一旦自身ip被封,赶快去找备胎。附带一个简单的判断网站是否需要代理方法:
- // 判断访问目标网站是否需要代理
- private boolean isNeedProxy() {
- boolean result = true;
- URL url;
- try {
- url = new URL("http://apkpure.com/");
- HttpURLConnection connection = (HttpURLConnection) url.openConnection();
- connection.setConnectTimeout(6000);
- // int i = connection.getResponseCode();
- int i = connection.getContentLength();
- if (i > 0) {
- result = false;
- }
- } catch (IOException e) {
- e.printStackTrace();
- }
- return result;
- }