• Java使用Jsoup解析网页代码实现


    本文转载自:https://www.cnblogs.com/boy1025/p/5040495.html,有少许修改

    一、Jsoup官网网站:http://jsoup.org/      

     通俗的讲,Jsoup就是一个解析网页的工具,官方解释:

    image

    二、Jsoup的基本用法:http://www.open-open.com/jsoup/parsing-a-document.htm

    image

    三、demo演示  解析的url:http://sex.guokr.com/

         1.解析一个ul –>li

    image

     

     

     

     

     

     

     

     

     

     

     

     

     

    我们来看下这段的源代码:

    image

     

     

     

     

     

     

     

     

     

     

     

     

    由此我们知道了大体的样子,现在我们来写编码

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    /**
     * 使用Jsoup解析url
     * @tag:url :http://sex.guokr.com/
     * Created by monster on 2015/12/11.
     */
    public class JsoupZX {
        public static void main(String[] args){
            final String url="http://sex.guokr.com/" ;
    
            try {
    
                Document doc = Jsoup.connect(url).get();
    
               Elements container = doc.getElementsByClass("container");
    
                Document containerDoc = Jsoup.parse(container.toString());
    
                Elements module = containerDoc.getElementsByClass("module-list");
    
                Document moduleDoc = Jsoup.parse(module.toString());
    
                //Elements clearfix = moduleDoc.getElementsByClass("clearfix");  //DOM的形式
    
                Elements clearfix = moduleDoc.select(".clearfix");  //选择器的形式
    
                for (Element clearfixli : clearfix){
                    Document clearfixliDoc = Jsoup.parse(clearfixli.toString());
                    Elements kind = clearfixliDoc.select(".board-tag");  //选择器的形式
                    Elements title = clearfixliDoc.select(".tit-post");
                    Elements author = clearfixliDoc.select("span a");
    
    
                   System.out.println("类别"+kind.text());  //分类
                   System.out.println("标题"+title.text());  //标题
                   System.out.println("作者"+author.text());  //作者
                    System.out.println("详情链接"+title.attr("href"));  //标题下的链接
    
                    System.out.println("=====================");
    
                }
                  //  String title = clearfixli.getElementsByTag("a").text();
    
    
              //  System.out.println(clearfix);
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    结果:

    image

    =================================================================================================

    2.解析详情页面和评论,链接:http://sex.guokr.com/post/1100992/

      image

    上述就是页面的内容, 然后我们看下源码:

    内容:

    image

    评论:

    image

    看完源码后,我们进行编码:

    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    /**
     * 使用Jsoup解析帖子详情和评论
     * @tag: url:http://sex.guokr.com/post/1100992/
     * Created by monster on 2015/12/11.
     */
    public class JSoupDetail {
    
        public static void main(String args[]){
            final String url=  "http://sex.guokr.com/post/1100992/";
    
    
            try {
    
                Document doc = Jsoup.connect(url).get();
    
                Elements container = doc.getElementsByClass("container");
    
                Document containerDoc = Jsoup.parse(container.toString());
    
                String articleTitle =  containerDoc.getElementById("articleTitle").text();
                String authorName = containerDoc.getElementById("authorName").text();
                String time = containerDoc.select("span").first().text();
                String imgphotoUrl=containerDoc.select("img").get(1).attr("src");
                System.out.println("标题:" + articleTitle); //标题
                System.out.println("作者:"+authorName); //作者
                System.out.println("发布时间:"+time); //发布时间
                System.out.println("作者头像的url:"+imgphotoUrl); //发布时间
    
    
                Element articleContent =  containerDoc.getElementById("articleContent");
                Document articleContentDoc = Jsoup.parse(articleContent.toString());
    
    
               int size=  articleContentDoc.select("p").size();
                System.out.println("段落数:"+size);
    
                System.out.println("帖子内容:");
    
                for (int i=0;i<size;i++){
                   String content =  articleContentDoc.select("p").get(i).text();
                    System.out.println(content);
                }
    
                System.out.println("================================================");
            System.out.println("帖子评论区域(按照楼层分布)");
    
                Elements cmts =containerDoc.getElementsByClass("cmts");
                Document cmtsDoc = Jsoup.parse(cmts.toString());
                System.out.println("评论楼层:"+cmtsDoc.select("span").first().text());
    
                Elements cmtslist =cmtsDoc.getElementsByClass("cmts-list");
    
                for (Element clearfix:cmtslist){
                    String user =  clearfix.select("a").get(1).text();
                    String userPhotoUrl =clearfix.select("img").get(0).attr("src");
                    String replyTime = clearfix.select("a").get(3).text();
                   String floor=clearfix.select("span").text();
    
                    System.out.println("评论者:"+user+"\n"+"评论者头像url:"+userPhotoUrl+"\n"+"回复时间:"+replyTime+"\n"+"所在楼层:"+floor);
                    Document replyContentDoc = Jsoup.parse(clearfix.toString());
                   Elements replyContent =  replyContentDoc.getElementsByClass("cmt-content");
                    System.out.println("评论内容:");
                    int s =replyContent.select("p").size();
                   for (int j=0;j<s;j++){
                     String replycontent =   replyContent.select("p").get(j).text();
                       System.out.println(replycontent);
    
    
                   }
                    System.out.println("================================================");
                }
    
            } catch (IOException e) {
                e.printStackTrace();
            }
    
    
        }
    
    }

    输出结果:

    image

  • 相关阅读:
    ArcGIS engine中Display类库 (局部刷新)
    ArcGIS 空间查询一例
    反射方法获取事件的委托链上的函数
    [转] 基于C#的波形显示控件的实现
    C# 对Excel操作时,单元格值的读取
    44.Node.js Express 框架--web框架
    43.安装npm及cnpm(Windows)
    42.cnpm不是内部命令的解决方案:配置环境变量
    41.Node.js使用cnpm
    40.Node.js Web 模块
  • 原文地址:https://www.cnblogs.com/nayitian/p/16269070.html
Copyright © 2020-2023  润新知