• python爬虫入门---第一篇:获取某一网页所有超链接


    这是一个通过使用requests和BeautifulSoup库,简单爬取网站的所有超链接的小爬虫。有任何问题欢迎留言讨论。

    import requests
    from bs4 import BeautifulSoup
    
    def getHTMLText(url):
        '''
        此函数用于获取网页的html文档
        '''
        try:
            #获取服务器的响应内容,并设置最大请求时间为6秒
            res = requests.get(url, timeout = 6)
            #判断返回状态码是否为200
            res.raise_for_status()
            #设置该html文档可能的编码
            res.encoding = res.apparent_encoding
            #返回网页HTML代码
            return res.text
        except:
            return '产生异常'
    
    def main():
        '''
        主函数
        '''
        #目标网页,这个可以换成一个你喜欢的网站
        url = 'https://www.cnblogs.com/huwt/'
    
        demo = getHTMLText(url)
    
        #解析HTML代码
        soup = BeautifulSoup(demo, 'html.parser')
    
        #模糊搜索HTML代码的所有包含href属性的<a>标签
        a_labels = soup.find_all('a', attrs={'href': True})
    
        #获取所有<a>标签中的href对应的值,即超链接
        for a in a_labels:
            print(a.get('href'))
    
    main()

     测试结果:

    https://www.cnblogs.com/huwt/
    https://www.cnblogs.com/huwt/
    https://www.cnblogs.com/
    https://www.cnblogs.com/huwt/
    https://i.cnblogs.com/EditPosts.aspx?opt=1
    https://msg.cnblogs.com/send/%E8%B7%AF%E6%BC%AB%E6%BC%AB%E6%88%91%E4%B8%8D%E7%95%8F
    https://www.cnblogs.com/huwt/rss
    https://i.cnblogs.com/
    https://www.cnblogs.com/huwt/archive/2019/04/10.html
    https://www.cnblogs.com/huwt/p/10680209.html
    https://www.cnblogs.com/huwt/p/10680209.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10680209
    https://www.cnblogs.com/huwt/p/10685968.html
    https://www.cnblogs.com/huwt/p/10685968.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10685968
    https://www.cnblogs.com/huwt/archive/2019/04/08.html
    https://www.cnblogs.com/huwt/p/10673470.html
    https://www.cnblogs.com/huwt/p/10673470.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10673470
    https://www.cnblogs.com/huwt/archive/2019/03/31.html
    https://www.cnblogs.com/huwt/p/10633896.html
    https://www.cnblogs.com/huwt/p/10633896.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10633896
    https://www.cnblogs.com/huwt/p/10632084.html
    https://www.cnblogs.com/huwt/p/10632084.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10632084
    https://www.cnblogs.com/huwt/archive/2019/03/30.html
    https://www.cnblogs.com/huwt/p/10629625.html
    https://www.cnblogs.com/huwt/p/10629625.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10629625
    https://www.cnblogs.com/huwt/archive/2019/03/25.html
    https://www.cnblogs.com/huwt/p/10597502.html
    https://www.cnblogs.com/huwt/p/10597502.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10597502
    https://www.cnblogs.com/huwt/archive/2019/03/24.html
    https://www.cnblogs.com/huwt/p/10591353.html
    https://www.cnblogs.com/huwt/p/10591353.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10591353
    https://www.cnblogs.com/huwt/archive/2019/03/16.html
    https://www.cnblogs.com/huwt/p/10540942.html
    https://www.cnblogs.com/huwt/p/10540942.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10540942
    https://www.cnblogs.com/huwt/p/10541675.html
    https://www.cnblogs.com/huwt/p/10541675.html
    https://i.cnblogs.com/EditPosts.aspx?postid=10541675
    https://www.cnblogs.com/huwt/default.html?page=2
    [Finished in 1.1s]
    蒹葭苍苍,白露为霜; 所谓伊人,在水一方。
  • 相关阅读:
    102. 教程:重装谷歌浏览器的教程
    IGBT知识普及
    [刷机资源] 荣耀8 E5 B391 V2 ROM集合 Xposed DPI调整等 N多自定义功能 Kangvip@HRT( 2017-9-28)
    ITPUB附件下载免输验证码 (实际下载地址的规则)
    花生壳内网穿透不再支持国外IP!
    golang 如何开发windows窗口界面
    golang 热重启
    强化go get命令
    go mod get go-git timeout
    golang单一职责原则接口设计例子
  • 原文地址:https://www.cnblogs.com/huwt/p/10355405.html
Copyright © 2020-2023  润新知