• Java | 技术应用 | 利用Jsoup处理页面


    根据微信公众号的推文链接地址,对文章内容进行爬取,利用jsoup解析文章源代码,加上结合xpth提取文文章信息,

    利用正则表达式读取文章发表时间。

    Jsoup

     <!-- jsoup HTML parser library @ http://jsoup.org/ -->
            <dependency>
              <groupId>org.jsoup</groupId>
              <artifactId>jsoup</artifactId>
              <version>1.10.2</version>
            </dependency>
    package search;
    
    import java.io.File;
    import java.io.IOException;
    
    import java.util.regex.*;
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    public class Files_process {
        
        public String[] get_content(String path) throws IOException {
            String[] content = new String[4];
            Document document = Jsoup.connect(path).get();
            
            ////*[@id="publish_time"]
            Elements em = document.select("script");
            ////*[@id="img-content"]
            //*[@id="js_content"]/section
            //获取主体内容
            Elements page_content = document.select("div#js_content");
            //*[@id="js_name"]
            //获取公众号名称
            Elements cname = document.select("a#js_name");
            content[0] = document.title();   //文章标题
            content[1] = cname.text();        //公众号名称
            content[2] = page_content.text();  //文章内容
            
            String code    = document.html();
            String str = "([0-9]{3}[1-9]|[0-9]{2}[1-9][0-9]{1}|[0-9]{1}[1-9][0-9]{2}|[1-9][0-9]{3})-(((0[13578]|1[02])-(0[1-9]|[12][0-9]|3[01]))|((0[469]|11)-(0[1-9]|[12][0-9]|30))|(02-(0[1-9]|[1][0-9]|2[0-8])))";
            Pattern pattern = Pattern.compile(str);
            Matcher matcher = pattern.matcher(code);
            if(matcher.find())
                content[3] = matcher.group();
            return content; 
        }
        
    }
    package search;
    
    
    public class processed {
        public static void main(String[] args) throws Exception {
              String[] content = null;
        
               Files_process fp = new Files_process();
                   
              content =  fp.get_content("http://mp.weixin.qq.com/s?__biz=MjM5NTc5ODM4Ng==&mid=2650901488&idx=1&sn=2a9924f776bc9683ff8e1a1e66fa4214&chksm=bd0627ed8a71aefb07a81e3df3444bb20011ecaaab3050d9f11ccba6f4a66239943dc2784cc4#rd");
              System.out.println("msg_title: "+content[0]);
                 System.out.println("nickname: "+content[1]);
                 System.out.println("msg_content: "+content[2]);
                 System.out.println("msg_time: "+content[3]);
                 System.out.println("msg_link: "+"");
                 System.out.println();
              
        }
    }
  • 相关阅读:
    链表问题----反转部分单向链表
    HTTP请求详解
    链表问题----删除链表的中间节点和a/b处的节点
    链表问题----删除倒数第K个节点
    栈和队列----最大值减去最小值小于等于num的子数组的数量
    栈和队列----求最大子矩阵的大小
    TCP/IP、Http、Socket的区别
    栈和队列----生成窗口的最大值数组
    linux根文件系统制作,busybox启动流程分析
    linux 内核启动流程分析,移植
  • 原文地址:https://www.cnblogs.com/jj81/p/9769399.html
Copyright © 2020-2023  润新知