• Java网络爬虫


    WikiScraper.java

    package master.haku.scrape;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import java.net.*;
    import java.io.*;
    
    public class WikiScraper {
        public static void main(String[] args) {
            scrapeTopic("/wiki/Python");
        }
    
        public static void scrapeTopic(String url) {
            String html = getUrl("https://en.wikipedia.org" + url);
            Document doc = Jsoup.parse(html);
            String contentText = doc.select("#mw-content-text > p").first().text();
            System.out.println(contentText);
        }
    
        public static String getUrl(String url) {
            URL urlObj = null;
            try {
                urlObj = new URL(url);
            } catch (MalformedURLException e) {
                System.out.println("The url was malformed!");
                return "";
            }
    
            URLConnection urlCon = null;
            BufferedReader in = null;
            String outputText = "";
    
            try {
                urlCon = urlObj.openConnection();
                in = new BufferedReader(new InputStreamReader(urlCon.getInputStream()));
                String line = "";
                while ((line = in.readLine()) != null) {
                    outputText += line;
                }
                in.close();
            } catch (IOException e) {
                System.out.println("There was an error connecting to the URL");
                return "";
            }
    
            return outputText;
        }
    }

    运行结果:

    A python is a constricting snake belonging to the Python (genus), or, more generally, any snake in the family Pythonidae (containing the Python genus).

  • 相关阅读:
    JSP
    Tomcat根据JSP生成Servlet机制解析
    JSON基础
    ATouch 吃鸡开发板原理及功能介绍
    Android触摸touchevent的AB两种方式(TYPE_A,TYPE_B)识别方法
    ubuntu诸软件安装
    linux kernel mini2440 start.S head-common.S 部分注释
    Android USB ADB ATUH 验证包验证流程
    USB协议学习
    Android memory dump
  • 原文地址:https://www.cnblogs.com/davidgu/p/4836305.html
Copyright © 2020-2023  润新知