lmth1 一个便捷的网页信息提取工具

0, Why lmth1?

玩Python的人十有八九用过urllib，扒数据的十有八九用过BeautifulSoup。我也不例外，平时抓数据几乎全用BeautifulSoup。
BeautifulSoup的功能挺不错，但就是API挫了点，用起来不顺。相对于中规中矩的API，我更中意jQuery的Fluent API。所以，花了两个晚上，以BeautifulSoup作为基础，搞了两个库lmth和lmth1：lmth提供基本功能，并负责Hpath解析；lmth1提供Fluent API，进行数据抓取。

lmth1的接口非常简单，它的实现更简单——不超过300行代码。但它的功能很强大，你很快就会看到，lmth1是如何用一行代码实现BeautifulSoup十行代码的功能的，而且，更易读。

1, 简介

如题。

使用前请将lmth.py, lmth1.py以及beautifulsoup.py放至Python的环境目录下。

2, Hpath

Hpath是一种我定义的一种类似于Xpath的HTML路径查询表达式，它的语法非常简单——几个例子就能说明白。如果需要严格的定义，请参考2.2的BNF定义。

2.1 实例阐述

注意，这里的例子所提到的获取元素，均为在目标节点下所获得的元素。

采用的实例HTML:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
2 <html xmlns="http://www.w3.org/1999/xhtml" >
3 <head>
4     <title>Untitled Page</title>
5 </head>
6 <body>
7 <h1 id="title">Page list</h1>
8 <div id="content" class="sites">
9     <a href="http://www.google.com/" class="good">Google</a>
10     <a href="http://www.yahoo.com/" class="good">Yahoo</a>
11     <a href="http://www.baidu.com/" class="asshole">Baidu</a>
12     <a href="http://www.bing.com/" class="excellent">Bing</a>
13 </div>
14 <div id="tbl">
15     <ul>
16     <li class="odd">1</li>
17     <li class="even">2</li>
18     <li class="odd">3</li>
19     <li class="even">4</li>
20     <li class="odd">5</li>
21     <li class="even">6</li>
22     </ul>
23 </div>
24 </body>
25 </html>

复制代码

2.1.1 基本表达式

作用：获取所有li元素
结果：

[
     <li class="odd">1</li>,
     <li class="even">2</li>,
     <li class="odd">3</li>,
     <li class="even">4</li>,
     <li class="odd">5</li>,
     <li class="even">6</li>
]

复制代码

div[id=tbl]

作用：获取所有id属性为tbl的div元素
提示：通过属性过滤来进行更精准的查找
结果：

复制代码

div[id=content, class=sites]

作用：获取所有id属性为name且class属性为grey的div元素
提示：你可以同时设定多个属性值，属性对之间用逗号分隔
结果：

<div id="content" class="sites">
<a href="http://www.google.com/" class="good">Google</a>
<a href="http://www.yahoo.com/" class="good">Yahoo</a>
<a href="http://www.baidu.com/" class="asshole">Baidu</a>
<a href="http://www.bing.com/" class="excellent">Bing</a>
</div>

复制代码

div[@id]

作用：获取所有div元素的id属性值
提示：你需要在需获取的属性值前加一个@符
结果：

[
'content',
'tbl'
]

复制代码

div[id=content]/a[@href]

作用：获取所有id属性为name的元素下面的p元素的href属性值
结果：

[
     'http://www.google.com',
     'http://www.yahoo.com',
     'http://www.baidu.com',
     'http://www.bing.com'
]

复制代码

lmth1 一个用Python编写的便捷网页信息提取工具 _Luc_ 博客园

lmth1 一个便捷的网页信息提取工具

0, Why lmth1?

1, 简介

2, Hpath

2.1 实例阐述