• 大学排名


    Python爬虫程序获取源码中的内容

    requests库用来获取源码:

      requests.get(url)返回URL页面的源码

      requests.raise_for_status()检测链接是否建立成功,只有返回200是成功,其余都会抛出错误给except

      requests.encoding = requests.apparent_encoding用来改变编码方式

    BeautifulSoup用来处理html源码:

      北京理工大学的嵩天老师在中国大学MOOC上的课程说的很好

      http://www.icourse163.org/learn/BIT-1001870001?tid=1001962001#/learn/content?type=detail&id=1002702161&cid=1003064638

    这个程序里有很强的格式化输出.format()和补齐中文空格的 char(12288)

    import requests
    import bs4
    from bs4 import BeautifulSoup

    def Gethtml(url):
      try:
        r = requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
      except:
        return ""

    def BaocunList(ulist,html):
      soup = BeautifulSoup(html, "html.parser")
      for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):#用来检测tr是不是标签类型,是则进
          tds = tr('td')
          ulist.append([tds[0].string, tds[1].string, tds[2].string, tds[3].string ,tds[4].string])

    def PrintList(ulist, num):
      tplt = "{0:^10} {1:{5}^10} {2:^10} {3:^10} {4:^10}"
      print(tplt.format("排名","学校名称","省市","总分","生源质量(高考成绩)",chr(12288)))
      for i in range(num):
        u=ulist[i]
        print(tplt.format(u[0],u[1],u[2],u[3],u[4],chr(12288)))

    def main():
      url = 'http://zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
      r = Gethtml(url)
      List = []
      BaocunList(List,r)
      PrintList(List,50)

    main()

  • 相关阅读:
    每日日报2020.12.1
    每日日报2020.11.30
    981. Time Based Key-Value Store
    1146. Snapshot Array
    565. Array Nesting
    79. Word Search
    43. Multiply Strings
    Largest value of the expression
    1014. Best Sightseeing Pair
    562. Longest Line of Consecutive One in Matrix
  • 原文地址:https://www.cnblogs.com/tianxxl/p/7655558.html
Copyright © 2020-2023  润新知