htmlcleaner 下载地址:htmlcleaner2_1.jar 源码下载:htmlcleaner2_1-all.zip
写一个测试用的html文件:html-clean-demo.html
- <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd " >
- < html xmlns = "http://www.w3.org/1999/xhtml " xml:lang = "zh-CN" dir = "ltr" >
- < head >
- < meta http-equiv = "Content-Type" content = "text/html; charset=GBK" />
- < meta http-equiv = "Content-Language" content = "zh-CN" />
- < title > html clean demo </ title >
- </ head >
- < body >
- < div class = "d_1" >
- < ul >
- < li > bar </ li >
- < li > foo </ li >
- < li > gzz </ li >
- </ ul >
- </ div >
- < div >
- < ul >
- < li > < a name = "my_href" href = "1.html" > text-1 </ a > </ li >
- < li > < a name = "my_href" href = "2.html" > text-2 </ a > </ li >
- < li > < a name = "my_href" href = "3.html" > text-3 </ a > </ li >
- < li > < a name = "my_href" href = "4.html" > text-4 </ a > </ li >
- </ ul >
- </ div >
- </ body >
- </ html >
模拟需求:取出title,name="my_href" 的链接,div的class="d_1"下的所有li内容。下面用htmlcleaner写代码,HtmlCleanerDemo.java
- package com.chenlb;
- import java.io.File;
- import org.htmlcleaner.HtmlCleaner;
- import org.htmlcleaner.TagNode;
- /**
- * htmlcleaner 使用示例.
- *
- * @author chenlb 2008-11-26 下午02:12:02
- */
- public class HtmlCleanerDemo {
- public static void main(String[] args) throws Exception {
- HtmlCleaner cleaner = new HtmlCleaner();
- TagNode node = cleaner.clean(new File( "html/html-clean-demo.html" ), "GBK");
- //按tag取.
- Object[] ns = node.getElementsByName("title" , true ); //标题
- if (ns.length > 0 ) {
- System.out.println("title=" +((TagNode)ns[ 0 ]).getText());
- }
- System.out.println("ul/li:" );
- //按xpath取
- ns = node.evaluateXPath("//div[@class='d_1']//li" );
- for (Object on : ns) {
- TagNode n = (TagNode) on;
- System.out.println(" text=" +n.getText());
- }
- System.out.println("a:" );
- //按属性值取
- ns = node.getElementsByAttValue("name" , "my_href" , true , true );
- for (Object on : ns) {
- TagNode n = (TagNode) on;
- System.out.println(" href=" +n.getAttributeByName( "href" )+ ", text=" +n.getText());
- }
- }
- }
cleaner.clean()中的参数,可以是文件,可以是url,可以是字符串内容。个人认为:比较常用的应该是evaluateXPath、 getElementsByAttValue、getElementsByName方法了。另外说明下,htmlcleaner 对不规范的html兼容性比较好。