利用xpath来提取所有标签里面的内容,即使标签头不同
1 #-*-coding:utf8-*- 2 import re 3 import os 4 from lxml import etree 5 html = ''' 6 <!DOCTYPE html> 7 <html> 8 <head lang="en"> 9 <meta charset="UTF-8"> 10 <title>测试-常规用法</title> 11 </head> 12 <body> 13 <div id="content"> 14 <ul id="useful"> 15 <li>我</li> 16 <ml>是</ml> 17 <li>谁</li> 18 </ul> 19 <ul id="useless"> 20 <li>who </li> 21 <li>am </li> 22 <li>i!</li> 23 </ul> 24 </div> 25 <div id="content"> 26 <ul id="useful"><li>你</li><ml>是</ml><li>谁!</li> 27 </ul> 28 <ul id="useless"><li>who </li><li>you </li><li>are!</li> 29 </ul> 30 </div> 31 32 </body> 33 </html> 34 ''' 35 selector = etree.HTML(html) 36 for k in range(1,3): 37 chinese = selector.xpath('//div[@id="content"][%s]/ul[@id="useful"]//text()'%k) 38 data = "".join([each for each in chinese]) 39 english = selector.xpath('//div[@id="content"][%s]/ul[@id="useless"]//text()'%k) 40 Data = "".join([each for each in english]) 41 print data 42 print Data
结果: