• 简单的实战


    0x00 获取某地七天天气预报


    打开中国天气网,随便查询当地天气,查看返回页面的源码

    发现我们需要的信息都在<ul>标签下的<li>标签中,所以这里的基本思路就是遍历<ul>标签下的<li>,每次获取日期,天气,温度三个数据,

    首先定义四个函数来实现全部功能。第一个函数获取网页信息,第二个函数将信息提取存入列表,第三个函数打印这些数据,最后是一个main函数

    import requests
    from bs4 import BeautifulSoup
    
    def getTHMLText(url):                                            #获取网页
        try:
            header={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0"}  
            r = requests.get(url, headers = header)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return ''
    
    def weather_info(html, lilist):
    
        soup = BeautifulSoup(html, "html.parser")
        li = soup.select("ul[class='t clearfix'] li")               
        for i in li:                                                #遍历li标签,并将数据存入列表
            try:
                date = i.select("h1")[0].text                       
                wea = i.select("p[class='wea']")[0].text
                temp = i.select('p[class="tem"] span')[0].text + "/" + i.select('p[class="tem"] i')[0].text
                lilist.append([date, wea, temp])
    
            except:
                return ""
    
    def printList(ulist, num):                                      #打印列表
        tplt = "{0:^10}	{1:{3}^10}	{2:^10}"
        print(tplt.format("日期", "天气", "温度", chr(12288)))
        for i in range(num):
            u = ulist[i]
            print(tplt.format(u[0], u[1], u[2], chr(12288)))
    
    
    def main():
        lilist = []
        url = 'http://www.weather.com.cn/weather/101190501.shtml'
        html = getTHMLText(url)
        weather_info(html, lilist)
        printList(lilist, 7)
    
    main()

     

    0x01 单线程下载网页全部图片


    目标网站:http://www.weather.com.cn/weather/101190501.shtml

    查看源码获取思路,筛选出所有img标签,图片url是每个img标签下src的值,将其储存为列表,遍历列表下载图片

    from bs4 import BeautifulSoup
    import requests
    
    def getHTMLText(url):
        try:
            header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0"}
            r = requests.get(url, headers=header, timeout= 30)
            r.raise_for_status()
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return ''
    
    def PictureUrlList(html, urls):
    
        soup = BeautifulSoup(html, "html.parser")
        img = soup.select('img')
        for i in img:
            try:
                src = i['src']
                urls.append(src)
            except:
                return ''
    
    
    def downloader(urls):
        try:
            for u in range(len(urls)):
                header = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0"}
                r = requests.get(urls[u], headers = header)
                pic = r.content
                path = 'D:/PIC/'+str(u)+'.png'
                with open(path, 'wb') as f:
                    f.write(pic)
        except:
            pass
    
    def main():
        urls = []
        url = "http://www.weather.com.cn/weather/101190501.shtml"
        html = getHTMLText(url)
        PictureUrlList(html, urls)
        downloader(urls)
    
    main()
  • 相关阅读:
    centos7系统初始化
    瀑布流无限加载infinitescroll插件与masonry插件使用
    网页前端导出CSV,Excel格式文件
    js添加收藏夹
    Fiddler修改http请求响应简单实例
    Nodejs的Gruntjs使用一则
    Jquery插件validate使用一则
    PostgreSQL操作-psql基本命令
    SSH连接时出现Host key verification failed的原因及解决方法以及ssh-keygen命令的用法
    在ubuntu20.04上设置python2为默认方式
  • 原文地址:https://www.cnblogs.com/Ragd0ll/p/10268205.html
Copyright © 2020-2023  润新知