• Python2获取网页标题


    Python获取网页标题

    使用Python2.x的urllib2lxml,速度应该还快于BeautifulSoup4(话说回来,为什么大家都要用BS4呢?一个XPATH不就完了吗)

    没有安装过的,用pip安装一下

    pip install lxml
    

    Shell演示:

    >> from lxml import etree
    >> import urllib2
    >> page = etree.HTML(urllib2.urlopen('https://blog.csdn.net/z690798364/article/details/79960358').read().decode('utf-8'))
    >> print page.xpath(u"/html/head/title")[0].text
    Lxml 解析网页用法笔记 - z690798364的专栏 - CSDN博客
    

    封装好了的函数:

    from lxml import etree
    import urllib2
    #...
    def get_site_title(link):
        send_headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Connection': 'keep-alive'
        }  # 伪装header
        try:  # 异常处理
            title = etree.HTML(urllib2.urlopen(urllib2.Request(link, headers=send_headers)).read().decode('utf-8')).xpath("/html/head/title")[0].text
        except:
            return link
        return title[0].text
    
  • 相关阅读:
    NIO编程介绍
    伪异步IO
    BIO模型
    Json
    Jquery笔记
    JQuery与Ajax
    jQuery之事件和动画
    JQueryDOM节点操作
    认识Jquery
    javascript之DOM
  • 原文地址:https://www.cnblogs.com/santiego/p/10328428.html
Copyright © 2020-2023  润新知