• BeautifulSoup 爬虫


    一 安装BeautifulSoup

    安装Python的包管理器pip 然后运行

    $pip3 install beautifulsoup

    在终端里导入它测试下是否安装成功

    >>>from bs import BeautifulSoup 

    如果没有错误,说明导入成功了

    简单例子 http://sc.chinaz.com/biaoqing/baozou.html 爬取图片

    代码如下

    from urllib.request import urlopen
    from urllib.error import HTTPError,URLError
    from bs4 import BeautifulSoup
    import re
    import warnings
    warnings.filterwarnings("ignore")
    def getTitle(url):
    list =[];
    try:
    html=urlopen(url);
    except (HTTPError,URLError) as e:
    return None;
    try:
    bsObj = BeautifulSoup(html)
    a=bsObj.findAll("img",{"src":re.compile("http://.*jpg|png|jpeg|tiff|raw|bmp|gig")});
    for i in a:
    if i['src']!="":
    list.append(i['src']);
    except AttributeError as e:
    return None;

    return list;
    # a=getTitle(url)
    # print(a)

    def getHread(is_urls):
    list=[];
    try:
    html = urlopen(is_urls);
    except (HTTPError, URLError) as e:
    return None;
    try:
    bsObj = BeautifulSoup(html)
    tables=bsObj.findAll("a")

    for i in tables:
    if "href" in i.attrs:
    list.append(i.attrs['href']);

    #print(getTitle(i.attrs['href']));
    temp=set(list);
    for d in temp:
    print(getTitle(d));
    except AttributeError as e:
    return None;
    #return list;
    is_ulrs="http://sc.chinaz.com/biaoqing/baozou.html";
    a=getHread(is_ulrs)
    print(a)
    ##################运行结果******************************
    没有具体需求 只是简单的例子 只是处理了重复返回的图片用到set集合 运行的速度有点慢 没有时间优化 等有时间一定好好写写。

  • 相关阅读:
    【Linux】ZeroMQ 在 centos下的安装
    ZeroMQ下载、编译和使用
    在Linux系统上安装Git
    Linux下python2.7安装pip
    [Tyvj1474]打鼹鼠
    [BZOJ2908]又是nand
    [SPOJ375]Qtree
    浅谈算法——树链剖分
    [BZOJ5368/Pkusc2018]真实排名
    [FJOI2007]轮状病毒
  • 原文地址:https://www.cnblogs.com/wxc1/p/6130079.html
Copyright © 2020-2023  润新知