• 爬虫利器beautifulsoup4


    ab官方文档:http://beautifulsoup.readthedocs.io/zh_CN/latest/

     

    获取标签a的文本 

    使用contents[0]取第一个元素

    <meta charset="UTF-8"> <!-- for HTML5 -->
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <html><head><title>yoyo ketang</title></head>
    <body>
    <b><!--Hey, this in comment!--></b>
    <p class="title"><b>yoyoketang</b></p>
    <p class="yoyo">这里是我的微信公众号:yoyoketang
    <a href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" class="sister" id="link1">fiddler</a>,
    <a href="http://www.cnblogs.com/yoyoketang/tag/python/" class="sister" id="link2">python</a>,
    <a href="http://www.cnblogs.com/yoyoketang/tag/selenium/" class="sister" id="link3">selenium</a>;
    快来关注吧!</p>
    <p class="story">...</p>
    

      

    用python的open函数读取这个html,如下图能正确打印出来,说明读取成功了
    aa = open("html123.html",encoding='UTF-8') print(aa.read())

      

     不带"html.parser"参数,这时候会有个waring的

      

         aa = open("html123.html",encoding='UTF-8')
            soup = BeautifulSoup(aa,"html.parser")
            print(type(soup))   #<class 'bs4.BeautifulSoup'>
            tag = soup.title
            print(type(tag))   # <class 'bs4.element.Tag'>
            print(tag)  #<title>yoyo ketang</title>
            string = tag.string
            print(type(string))  #<class 'bs4.element.NavigableString'>
            print(string)  #yoyo ketang
            comment = soup.b.string
            print(type(comment))  #<class 'bs4.element.Comment'>
            print(comment)  #Hey, this in comment!
    

      

         aa = open("html123.html",encoding='UTF-8')
            soup = BeautifulSoup(aa,"html.parser")
            tag1 = soup.head
            print(tag1)  #<head><title>yoyo ketang</title></head>
            print(tag1.name)   #head
            tag2 = soup.title
            print(tag2)  #<title>yoyo ketang</title>
            print(tag2.name)  #title
            tag3 = soup.a
            print(tag3) #<a class="sister" href="http://www.cnblogs.com/yoyoketang/tag/fiddler/" id="link1">fiddler</a>
            print(tag3.name)  #a
            print(soup.name)  #[document]
    

      

         url = "http://699pic.com/sousuo-218808-13-1-0-0-0.html"
            r = requests.get(url)
            soup = BeautifulSoup(r.content,"html.parser")
            # 找出所有的标签
            images = soup.find_all(class_="lazy")  # print images # 返回list对象
            for i in images:
                jpg_rl = i["data-original"] #获取url地址 
                title = i["title"]  # 返回title名称
                print(jpg_rl)
                print(title)
                print("")
    

      

    获取图片内容用.content方法

         url = "http://699pic.com/sousuo-218808-13-1-0-0-0.html"
            r = requests.get(url)
            soup = BeautifulSoup(r.content,"html.parser")
            # 找出所有的标签
            images = soup.find_all(class_="lazy")  # print images # 返回list对象
            for i in images:
                jpg_rl = i["data-original"] #获取url地址 
                title = i["title"]  # 返回title名称
                print(jpg_rl)
                print(title)
                print("")
                with open(os.getcwd()+ "\jpg\"+title+".jpg","wb") as f:
                    f.write(requests.get(jpg_rl).content)
           #except:
           #  pass

      

  • 相关阅读:
    js 获取当前时间
    html5拨打电话及发短信
    ::before和::after伪元素的使用
    vue单页面应用刷新网页后vuex的state数据丢失问题以及beforeunload的兼容性
    CSS3径向渐变实现优惠券波浪造型
    iOS 幻灯片的自动循环滚动
    iOS 编译正常,但无法运行到真机和模拟器上,Choose a destination with a supported architecture in order to run on this device.
    iOS webView抓取改变js的alertView
    iOS 创建单例的方法
    webView图片点击可以实现预览效果
  • 原文地址:https://www.cnblogs.com/zhangbao003/p/8917269.html
Copyright © 2020-2023  润新知