python lxml

1.  解析html并建立dom
 
>>> import lxml.etree as etree
 
>>> html = '<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
>>> dom = etree.fromstring(html)
>>> etree.tostring(dom)
'<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
 
如果用beautifulsoup的解析器，则
 
>>> import lxml.html.soupparser as soupparser
>>> dom = soupparser.fromstring(html)
>>> etree.tostring(dom)
'<html><body id="1">abc<div>123</div>def<div>456</div>ghi</body></html>'
 
但是我强烈建议使用soupparser，因为其处理不规范的html的能力比etree强太多。
 
2.  按照Dom访问Element
 
子元素长度
 
>>> len(dom)
1
 
访问子元素：
 
>>> dom[0].tag
'body'
 
循环访问：
 
>>> for child in dom:
...     print child.tag
... 
body
 
查看节点索引
 
>>>body = dom[0]
 
>>> dom.index(body)
0
 
字节点获取父节点
 
>>> body.getparent().tag
'html'
 
访问所有子节点
 
>>> for ele in dom.iter():
...     print ele.tag
... 
html
body
div
div
 
3. 访问节点属性
 
>>> body.get('id')
'1'
 
也可以这样
 
>>> attrs = body.attrib
>>> attrs.get('id')
'1'
 
4. 访问Element的内容
 
>>> body.text
'abc'
>>> body.tail
 
text只是从本节点开始到第一个字节点结束；tail是从最后一个字节结束到本节点未知。
 
访问本节点所有文本信息
 
>>> body.xpath('text()')
['abc', 'def', 'ghi']
 
访问本节点和子节点所有文本信息
 
>>> body.xpath('//text()')
['abc', '123', 'def', '456', 'ghi']
 
貌似返回本文档中所有文字信息
 
body.text_content()返回本节点所有文本信息。
 
5.Xpath的支持
 
所有的div元素
 
>>> for ele in dom.xpath('//div'):
...     print ele.tag
... 
div
div
 
id=“1”的元素
 
>>> dom.xpath('//*[@id="1"]')[0].tag
'body'
 
body下的第1个div
 
>>> dom.xpath('body/div[1]')[0].tag
'div'
原文地址：https://www.cnblogs.com/arhatlohan/p/4217055.html