BeautifulSoup 爬虫

一安装BeautifulSoup

安装Python的包管理器pip 然后运行

$pip3 install beautifulsoup

在终端里导入它测试下是否安装成功

>>>from bs import BeautifulSoup

如果没有错误，说明导入成功了

简单例子 http://sc.chinaz.com/biaoqing/baozou.html 爬取图片

代码如下

from urllib.request import urlopen
from urllib.error import HTTPError,URLError
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings("ignore")
def getTitle(url):
    list =[];
    try:
       html=urlopen(url);
    except (HTTPError,URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        a=bsObj.findAll("img",{"src":re.compile("http://.*jpg|png|jpeg|tiff|raw|bmp|gig")});
        for i in a:
            if i['src']!="":
               list.append(i['src']);
    except AttributeError as e:
        return None;

    return list;
# a=getTitle(url)
# print(a)

def getHread(is_urls):
    list=[];
    try:
        html = urlopen(is_urls);
    except (HTTPError, URLError) as e:
        return None;
    try:
        bsObj = BeautifulSoup(html)
        tables=bsObj.findAll("a")

        for i in tables:
            if "href" in i.attrs:
               list.append(i.attrs['href']);

             #print(getTitle(i.attrs['href']));
        temp=set(list);
        for d in temp:
            print(getTitle(d));
    except AttributeError as e:
        return None;
    #return list;
is_ulrs="http://sc.chinaz.com/biaoqing/baozou.html";
a=getHread(is_ulrs)
print(a)
##################运行结果****************************** 
没有具体需求 只是简单的例子 只是处理了重复返回的图片用到set集合 运行的速度有点慢 没有时间优化 等有时间一定好好写写。

相关阅读:
【Linux】ZeroMQ 在 centos下的安装
ZeroMQ下载、编译和使用
在Linux系统上安装Git
Linux下python2.7安装pip
[Tyvj1474]打鼹鼠
[BZOJ2908]又是nand
[SPOJ375]Qtree
浅谈算法——树链剖分
[BZOJ5368/Pkusc2018]真实排名
[FJOI2007]轮状病毒

原文地址：https://www.cnblogs.com/wxc1/p/6130079.html