URLConnection简单爬虫(转)

转自：http://bbs.itcast.cn/thread-6269-1-1.html
收录方便以后参考

一、URLConnection入门

URLConnection代表应用程序和 URL 之间的通信链接。
创建一个到 URL 的连接需要几个步骤：
1、通过在 URL 上调用 openConnection 方法创建连接对象。
URL url = new URL("http://localhost:8080/day04/1.html");
2、处理设置参数和一般请求属性。

表示应用程序要将数据写入 URL 连接，及发送数据

conn.setDoOutput(true); //默认false

表示应用程序要从 URL 连接读取数据，及获取数据

      conn.setDoInput(true); //默认true
3、使用 connect 方法建立到远程对象的实际连接。
      conn.connect();
      只是建立了一个与服务器的tcp连接，并没有实际发送http请求
4、远程对象变为可用。远程对象的头字段和内容变为可访问。
      conn.getOutputStream();
               注意：
                        如果setDoOutput没有设置true，会出现java.net.ProtocolException异常
                        getOutputStream会隐含的进行connect
      conn.getInputStream();
               注意：在调用此方法之前以上所准备的数据仅缓存在本地内存中。调用了此方法将内存缓冲区中封装好的完整的HTTP请求发送到服务端

package cn.itcast.url;

import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;

/**
* URLConnection简单示例
* @author <a href="mailto:liangtong@itcast.cn">梁桐</a>
*/
public class UrlConnectionTest {
        
        public static void main(String[] args) throws Exception{
                //对指定的web链接进行描述
                URL url = new URL("http://localhost:8080/day04/1.html");
                //确定链接是否，获得链接--链接否？
                URLConnection conn = url.openConnection();
               
                //设置参数
                conn.setDoOutput(true);  //默认false，是否可发送数据
                conn.setDoInput(true);   //默认true,是否可以接受数据
                //链接
                conn.connect();
               
                //发送数据
                OutputStream out = conn.getOutputStream();  //java.net.ProtocolException
                out.write("username=rose".getBytes());
               
                //获得资源，并打印到控制台
                InputStream is = conn.getInputStream();
                Scanner scanner = new Scanner(is);
               
                while(scanner.hasNext()){
                        System.out.println(scanner.nextLine());
                }
                is.close();
                out.close();
        }

}

二、URLConnection--简单爬虫

上一节URLConnection入门我们简单的介绍了URLConnection类的使用。今天我们将在此基础上完成一个小功能--爬虫
原理：获得下载资源，使用程序将此资源保存到本地
步骤：
1、分析下载站点

下载内容列表页：http://sc.*.com/tubiao/index.html

下载内容详细页：/tubiao/120602202460.htm

下载地址： http://*.com/Files/download/icon3/4620.rar

2、通过程序实现以上分析过程，并获得下载地址

使用循环获得所有的下载列表页的链接
通过链接将列表页下载，并通过正则表达式，获得当前列表页中所有的详细页链接
下载详细页，仍通过正则表达式，获得下载地址链接。

3、将内容保存到本地
通过下载地址链接，将需要下载的内容保存到本地

package cn.itcast.download;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.OutputStream;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
 * 简单爬虫
 * @author <a href="mailto:liangtong@itcast.cn">梁桐</a>
 */
public class ICOLoad {

        public static void main(String[] args) throws Exception {
                //遍历测试站点的所有列表页。测试站点列表页的第一页比较测试，所以单独处理
                for (int i = 1; i <= 266; i++) {
                        String url;
                        if (i == 1) {
                                url = "http://sc.*.com/tubiao/index.html";
                        } else {
                                url = "http://sc.*.com/tubiao/index_" + i + ".html";
                        }
                        System.out.println("################# " + url);
                        //获得当前列表页中，所有可用详细页链接
                        List<String> paths = ICOLoad.getPaths(url, 1);
                        for (String path : paths) {
                                System.out.println("获取" + path);
                                //获得当前详细页的指定的下载链接
                                List<String> downURLs = ICOLoad.getPaths(path, 2);
                                for (String downURL : downURLs) {
                                        System.out.println("下载中 ..." + downURL);
                                        // 下载
                                        ICOLoad.downIcon(downURL);
                                        System.out.println(downURL + "下载完成");

                                }

                        }

                }

        }

        /**
         * 通过给定的解析方法，解析需要的路径
         * @return
         * @throws Exception
         */
        public static List<String> getPaths(String url, int count) {
                List<String> paths = new ArrayList<String>();
                try {
                        URL icoUrl = new URL(url);
                        URLConnection conn = icoUrl.openConnection();
                        //测试网站使用的gb2312编码
                        BufferedReader reader = new BufferedReader(new InputStreamReader(conn.getInputStream(), "gb2312"));
                        String line = null;
                        while ((line = reader.readLine()) != null) {
                                //此处直接使用if简化操作
                                // 处理列表页面
                                if (count == 1) {
                                        String path = findPath(line);
                                        if (path != null) {
                                                //列表页中的详细页地址都是相对路径，此处获得完整路径
                                                String completeUrl = new URL(icoUrl, path).toString();
                                                if (!paths.contains(completeUrl)) { // 过滤已有，测试网站图标和链接出各有一个所需链接
                                                        paths.add(completeUrl);
                                                }
                                        }
                                }
                                // 处理下载地址
                                if (count == 2) {
                                        String path = findDownURL(line);
                                        if (path != null) {
                                                paths.add(path);
                                        }
                                }
                        }
                        reader.close();
                        return paths;
                } catch (MalformedURLException e) {
                        e.printStackTrace();
                } catch (UnsupportedEncodingException e) {
                        e.printStackTrace();
                } catch (IOException e) {
                        e.printStackTrace();
                }
                return paths;
        }

        /**
         * 处理列表页
         * href="/tubiao/120602202460.htm"
         */
        private static Pattern pathPattern = Pattern.compile(".*(/tubiao/[\\d]+.htm).*");
        public static String findPath(String str) {
                // String str = "<span><a target=\"_blank\" href=\"/tubiao/120524539010.htm\" alt=\"网站小图标下载\">网站小图标下载</a></span>";

                Matcher matcher = pathPattern.matcher(str);
                if (matcher.find()) {
                        return matcher.group(1);
                }
                return null;

        }

        /**
         * 处理下载页,仅下载icon页面
         * <a href="http://*.com/Files/DownLoad/icon/rw_11.rar">ICO中国网通下载</a>
         * <a href="http://*.com/Files/download/icon3/4620.rar">ICO中国网通下载</a>
         */
        private static Pattern downPattern = Pattern.compile(".*(http://*.com/.*icon.*.rar).*");
        public static String findDownURL(String str) {
                Matcher matcher = downPattern.matcher(str);
                if (matcher.find()) {
                        return matcher.group(1);
                }
                return null;
        }

        /**
         * 下载文件所在位置
         */
        private static File downFile = new File("D:\\Image\\icoDown\\");
        static {
                if (!downFile.exists()) {
                        downFile.mkdirs();
                }
        }

        /**
         * 下载指定内容
         * @param path
         */
        public static void downIcon(String path) {
                try {
                        URL url = new URL(path);
                        String icoFileUrl = url.getFile();
                        String fileName = getFileName(icoFileUrl.substring(1 + icoFileUrl
                                        .lastIndexOf("/")));
                        File icoFile = new File(downFile + File.separator + fileName);
                        if (!icoFile.exists()) {
                                icoFile.createNewFile();
                                URLConnection conn = url.openConnection();
                                InputStream is = conn.getInputStream();

                                OutputStream out = new FileOutputStream(icoFile);
                                byte[] buf = new byte[1024];
                                int len = -1;
                                while ((len = is.read(buf)) > -1) {
                                        out.write(buf, 0, len);
                                }
                                out.close();
                                is.close();
                        }
                } catch (MalformedURLException e) {
                        e.printStackTrace();
                } catch (FileNotFoundException e) {
                        e.printStackTrace();
                } catch (IOException e) {
                        e.printStackTrace();
                }
        }

        /**
         * 获得下载文件本地名称，如果重名自动加1
         * @param fileName
         * @return
         */
        public static String getFileName(String fileName) {
                File file = new File(downFile + File.separator + fileName);
                if (file.exists()) {
                        String fileNum = fileName.substring(0, fileName.indexOf("."));
                        Integer num = Integer.valueOf(fileNum);
                        return getFileName((num + 1) + ".rar");
                }
                return fileName;
        }
}

相关阅读:
3. selenium_pytesseract 识别验证码
 7. selenium javascript进行滚动条、alert等操作
 初入职被 PUA，被骂废物，能忍？
1. selenium 环境搭建
 5. FastAPI 开发 POST 请求体字段处理
 2. selenium自动化项目搭建
 3. FastAPI 开发 GET 请求，queryparams查询参数
 kubectl apply f test.yaml报错
 k8s 中vault升级到1.9.0问题
 Vault与Kubernetes 的深度整合
原文地址：https://www.cnblogs.com/qinxike/p/2855019.html