python HTMLparser

1.概述

 1 如果我们要编写一个搜索引擎，第一步是用爬虫把目标网站的页面抓下来，
 2 第二步就是解析该HTML页面，看看里面的内容到底是新闻、图片还是视频。
 3 
 4 假设第一步已经完成了，第二步应该如何解析HTML呢？
 5 
 6 HTML本质上是XML的子集，但是HTML的语法没有XML那么严格，所以不能用标准的DOM或SAX来解析HTML。
 7 也可以用 re正则表达式
 8 scrapy 框架下的css选择器或者xpath
 9 
10 仁者见仁智者见智

2.HTMLparser

 1 # 使用时需要定义一个从类HTMLParser继承的类，重定义函数：
 2 # handle_starttag( tag, attrs)
 3 #handle_startendtag( tag, attrs)
 4 # handle_endtag( tag)
 5 
 6 # 来实现自己需要的功能。
 7 
 8 # tag是的html标签，attrs是 (属性，值)元组(tuple)的列表(list)。
 9 # HTMLParser自动将tag和attrs都转为小写
10 
11 
12 from html.parser import HTMLParser  
13 class MyHTMLParser(HTMLParser):   
14     def __init__(self):   
15         HTMLParser.__init__(self)   
16         self.links = []   
17     def handle_starttag(self, tag, attrs):   
18         #print "Encountered the beginning of a %s tag" % tag   
19         if tag == "a":   
20             if len(attrs) == 0:   
21                 pass   
22             else:   
23                 for (variable, value) in attrs:   
24                     if variable == "href":   
25                         self.links.append(value)   
26                      
27 if __name__ == "__main__":   
28     html_code = """ <a href="www.google.com"> google.com</a> <A Href="www.pythonclub.org"> PythonClub </a> <A HREF = "www.sina.com.cn"> Sina </a> """   
29     hp = MyHTMLParser()   
30     hp.feed(html_code)   
31     hp.close()   
32     print(hp.links)  
33 # 运行结果
34 # ['www.google.com', 'www.pythonclub.org', 'www.sina.com.cn']

3.总结

1 个人观点 2 如果是做搜索引擎建议还是用scrapy框架

参照：https://www.cnblogs.com/mfryf/p/3691563.html

相关阅读:
通过scrapy内置的ImagePipeline下载图片到本地、并提取本地保存地址
算法的时间复杂度和空间复杂度
session cookie的区别最全总结
汉明码（海明码）计算方法
测试 markdown
PHP扩展--opcache安装及配置
PHP_EOL
BUG：php7.1 访问yii数据库自动加端口3306 报错
BUG:upstream timed out (10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected
图的遍历之深度优先搜索和广度优先搜索

原文地址：https://www.cnblogs.com/jum-bolg/p/11094743.html