一 初见网络爬虫
都是使用的python3。
一个简单的例子:
from urllib.request import urlopen html = urlopen("http://pythonscraping.com/pages/page1.html") print(html.read())
在 Python 2.x 里的 urllib2 库, 在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.request、 urllib.parse 和 urllib.error。
二 BeautifulSoup
1.使用BeautifulSoup
注意:1.通过pip install BeautifulSoup4 安装模块
2. 建立可靠的网络连接,能处理程序可能会发生的异常
如下面这个例子:
from urllib.error import HTTPError from urllib.request import urlopen from bs4 import BeautifulSoup def getTitle(url): try: html = urlopen(url) except HTTPError as e: return None try: bsobj = BeautifulSoup(html.read()) title = bsobj.body.h1 except AttributeError as e: return None return title title = getTitle("http://pythonscraping.com/pages/page1.html") if title == None: print("title was not found") else: print(title)
2. 网络爬虫可以通过 class 属性的值,获得指定的内容
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/warandpeace.html") bsobj = BeautifulSoup(html) # 通过bsobj对象,用fillAll函数抽取class属性为red的span便签 contentList = bsobj.findAll("span",{"class":"red"}) for content in contentList: print(content.get_text()) print(' ')
3. 通过导航树
from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://pythonscraping.com/pages/page3.html") bsobj = BeautifulSoup(html) #找出子标签 for child in bsobj.find("table",{"id":"giftList"}).children: print(child) #找出兄弟标签 for silbling in bsobj.find("table",{"id":"giftList"}).tr.next_siblings: print(silbling) for h2title in bsobj.findAll("h2"): print(h2title.get_text()) print(bsobj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())
5. 正则表达式和BeautifulSoup
from urllib.request import urlopen from bs4 import BeautifulSoup import re html = urlopen("http://pythonscraping.com/pages/page3.html") bsobj = BeautifulSoup(html) #返回字典对象images images = bsobj.findAll("img",{"src":re.compile("../img/gifts/img.*.jpg")}) for image in images: print(image["src"])