• Python学习笔记(四十一)— 内置模块(10)urllib


    摘抄自:https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001432688314740a0aed473a39f47b09c8c7274c9ab6aee000

    Get

    urllib的request模块可以非常方便地抓取URL内容,也就是发送一个GET请求到指定的页面,然后返回HTTP的响应

    例如,对豆瓣的一个URLhttps://api.douban.com/v2/book/2129650进行抓取,并返回响应:

    from urllib import request
    
    with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
        data = f.read()
        print('Status:', f.status, f.reason)
        for k, v in f.getheaders():
            print('%s: %s' % (k, v))
        print('Data:', data.decode('utf-8'))
    Status: 200 OK
    Date: Sun, 03 Sep 2017 08:41:22 GMT
    Content-Type: application/json; charset=utf-8
    Content-Length: 2058
    Connection: close
    Vary: Accept-Encoding
    X-Ratelimit-Remaining2: 97
    X-Ratelimit-Limit2: 100
    Expires: Sun, 1 Jan 2006 01:00:00 GMT
    Pragma: no-cache
    Cache-Control: must-revalidate, no-cache, private
    Set-Cookie: bid=dDzHhyeuVQ0; Expires=Mon, 03-Sep-18 08:41:22 GMT; Domain=.douban.com; Path=/
    X-DOUBAN-NEWBID: dDzHhyeuVQ0
    X-DAE-Node: sindar15a
    X-DAE-App: book
    Server: dae
    Data: {"rating":{"max":10,"numRaters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],"pubdate":"2007","tags":[{"count":21,"name":"spring","title":"spring"},{"count":13,"name":"Java","title":"Java"},{"count":6,"name":"javaee","title":"javaee"},{"count":5,"name":"j2ee","title":"j2ee"},{"count":4,"name":"计算机","title":"计算机"},{"count":3,"name":"藏书","title":"藏书"},{"count":3,"name":"编程","title":"编程"},{"count":3,"name":"POJO","title":"POJO"}],"origin_title":"","image":"https://img3.doubanio.com/mpic/s2552283.jpg","binding":"平装","translator":[],"catalog":"","pages":"509","images":{"small":"https://img3.doubanio.com/spic/s2552283.jpg","large":"https://img3.doubanio.com/lpic/s2552283.jpg","medium":"https://img3.doubanio.com/mpic/s2552283.jpg"},"alt":"https://book.douban.com/subject/2129650/","id":"2129650","publisher":"电子工业出版社","isbn10":"7121042622","isbn13":"9787121042621","title":"Spring 2.0核心技术与最佳实践","url":"https://api.douban.com/v2/book/2129650","alt_title":"","author_intro":"","summary":"本书注重实践而又深入理论,由浅入深且详细介绍了Spring 2.0框架的几乎全部的内容,并重点突出2.0版本的新特性。本书将为读者展示如何应用Spring 2.0框架创建灵活高效的JavaEE应用,并提供了一个真正可直接部署的完整的Web应用程序——Live在线书店(http://www.livebookstore.net)。
    在介绍Spring框架的同时,本书还介绍了与Spring相关的大量第三方框架,涉及领域全面,实用性强。本书另一大特色是实用性强,易于上手,以实际项目为出发点,介绍项目开发中应遵循的最佳开发模式。
    本书还介绍了大量实践性极强的例子,并给出了完整的配置步骤,几乎覆盖了Spring 2.0版本的新特性。
    本书适合有一定Java基础的读者,对JavaEE开发人员特别有帮助。本书既可以作为Spring 2.0的学习指南,也可以作为实际项目开发的参考手册。","price":"59.8"}

     如果我们要想模拟浏览器发送GET请求,就需要使用Request对象,通过往Request对象添加HTTP头,我们就可以把请求伪装成浏览器。例如,模拟iPhone 6去请求豆瓣首页:

    from urllib import request
    
    # 模拟浏览器发送GET请求,就需要使用Request对象
    req = request.Request('http://www.douban.com/')
    req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
    with request.urlopen(req) as f:
        print('Status:', f.status, f.reason)
        for k, v in f.getheaders():
            print('%s: %s' % (k, v))
        print('Data:', f.read().decode('utf-8'))
    .......
    <meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0"> <meta name="format-detection" content="telephone=no"> <link rel="canonical" href="https://m.douban.com/"> <link href="https://img3.doubanio.com/f/talion/3c45a4b3705e30953879f6078082cbd1b9f88858/css/card/base.css" rel="stylesheet">
    .......

    Post

    如果要以POST发送一个请求,只需要把参数databytes形式传入。

    我们模拟一个微博登录,先读取登录的邮箱和口令,然后按照weibo.cn的登录页的格式以username=xxx&password=xxx的编码传入:

    from urllib import request, parse
    
    print('Login to weibo.cn......')
    email = input('Email: ')
    passwd = input('Password: ')
    login_data = parse.urlencode([
        ('username', email),
        ('password', passwd),
        ('entry', 'mweibo'),
        ('client_id', ''),
        ('savestate', '1'),
        ('ec', ''),
        ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
    ])
    
    req = request.Request('https://passport.weibo.cn/sso/login')
    req.add_header('Origin', 'http://passport.weibo.cn')
    req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
    req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')
    
    with request.urlopen(req, data=login_data.encode('utf-8')) as f:
        print('State:', f.status, f.reason)
        for k, v in f.getheaders():
            print('%s: %s' % (k, v))
        print('Data:', f.read().decode('utf-8'))
    State: 200 OK
    Server: nginx/1.6.1
    Date: Sun, 03 Sep 2017 11:31:56 GMT
    Content-Type: text/html
    Transfer-Encoding: chunked
    Connection: close
    Vary: Accept-Encoding
    Cache-Control: no-cache, must-revalidate
    Expires: Sat, 26 Jul 1997 05:00:00 GMT
    Pragma: no-cache
    Access-Control-Allow-Origin: http://passport.weibo.cn
    Access-Control-Allow-Credentials: true
    Set-Cookie: SUB=_2A250r5h8DeThGeBN7lUY9yrOzT2IHXVUUzg0rDV6PUJbkdBeLXPnkW08RRwH9G8I4bQbO4O9n3iyqeIP8g..; Path=/; Domain=.weibo.cn; Expires=Mon, 03 Sep 2018 11:31:56 GMT; HttpOnly
    Set-Cookie: SUHB=0M2Veoz3CDVYaB; expires=Monday, 03-Sep-2018 11:31:56 GMT; path=/; domain=.weibo.cn
    Set-Cookie: SCF=Ah1KXnqURq1Vwg0pcnz1J2hopmgB_WeMnJp9lOca0OIZ5xbPll3pP4EXHcrcZF3U5QuKuhvMlNKw9Vr8u3coL14.; expires=Wednesday, 01-Sep-2027 11:31:56 GMT; path=/; domain=.weibo.cn; httponly
    Set-Cookie: SSOLoginState=1504438316; path=/; domain=weibo.cn
    Set-Cookie: ALF=1507030316; expires=Tuesday, 03-Oct-2017 11:31:56 GMT; path=/; domain=.sina.cn
    DPOOL_HEADER: dryad62
    SINA-LB: aGEuMTI3LmcxLm5mamQubGIuc2luYW5vZGUuY29t
    SINA-TS: Y2ZjYTk0Y2UgMCAwIDAgOSAzODYK
    Data: {"retcode":20000000,"msg":"","data":{"loginresulturl":"https://passport.weibo.com/sso/crossdomain?entry=mweibo&action=login&proj=1&ticket=ST-NjM1Nzk3NDI2MQ%3D%3D-1504438316-gz-509C6EAFA74DA5C86B1AEB13AEB7D6B8-1&display=0&cdurl=https%3A%2F%2Flogin.sina.com.cn%2Fsso%2Fcrossdomain%3Fentry%3Dmweibo%26action%3Dlogin%26proj%3D1%26ticket%3DST-NjM1Nzk3NDI2MQ%253D%253D-1504438316-gz-46B914F433231C881EA55B8D2E8FBE98-1%26display%3D0%26cdurl%3Dhttps%253A%252F%252Fpassport.sina.cn%252Fsso%252Fcrossdomain%253Fentry%253Dmweibo%2526action%253Dlogin%2526display%253D0%2526ticket%253DST-NjM1Nzk3NDI2MQ%25253D%25253D-1504438316-gz-94CAAA0133A8B28346F7993B8357F442-1","uid":"6357974261"}}

    如果登录失败,得到的响应:

    ...
    Data: {"retcode":50011015,"msg":"u7528u6237u540du6216u5bc6u7801u9519u8bef","data":{"username":"example@python.org","errline":536}}

    Handler

    如果还需要更复杂的控制,比如通过一个Proxy去访问网站,我们需要利用ProxyHandler来处理,示例代码如下:(待理解..........)

    proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
    proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
    proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
    opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
    with opener.open('http://www.example.com/login.html') as f:
        pass

    小结

    urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能,需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求,再根据浏览器的请求头来伪装,User-Agent头就是用来标识浏览器的。

    练习

    利用urllib读取XML,将XML一节的数据由硬编码改为由urllib获取:

    # 查询天气
    # -*- coding: utf-8 -*-
    
    import urllib.request, urllib.parse
    from xml.parsers.expat import ParserCreate
    
    class weatherSaxHandler(object):
        def __init__(self):
            self._location = {}
            self._forcast = []
    
        def start_element(self, name, attrs):
            if name == 'yweather:location':
                self._location = attrs
                attrs.pop('xmlns:yweather')
            if name == 'yweather:forecast':
                self._forcast.append(attrs)
        def end_element(self, name):
            pass
        def char_data(self, text):
            pass
    
    def parse_weather(xml):   # 输入xml字符串, 输出天气信息dict
        parser = ParserCreate()
        handler = weatherSaxHandler()
        parser.StartElementHandler = handler.start_element
        parser.EndElementHandler = handler.end_element
        parser.CharacterDataHandler = handler.char_data
        parser.Parse(xml)
        today = {
            'text': handler._forcast[0]['text'],
            'low': int(handler._forcast[0]['low']),
            'high': int(handler._forcast[0]['high'])
        }
        tomorrow = {
            'text': handler._forcast[1]['text'],
            'low':  int(handler._forcast[1]['low']),
            'high': int(handler._forcast[1]['high'])
        }
        d = {
            'today' : today,
            'tomorrow': tomorrow
        }
        weather = handler._location
        weather.update(d)
        return weather
    
    def get_weather(city):  # 输入城市名(拼音) 字符串, 输出天气dict
        baseurl = "https://query.yahooapis.com/v1/public/yql?"
        yql_query = 'select * from weather.forecast where woeid in (select woeid from geo.places(1) where text="%s")' % city
        yql_url = baseurl + urllib.parse.urlencode({'q':yql_query})
        with urllib.request.urlopen(yql_url) as f:
            city_xml = f.read().decode('utf-8')
        city_weather = parse_weather(city_xml)
        return city_weather
    
    if __name__ == '__main__':
        city = input('Weather Forecast in City:')
        print(get_weather(city))
    Weather Forecast in City:Beijing
    {'city': 'Beijing', 'region': ' Beijing', 'tomorrow': {'text': 'Mostly Cloudy', 'low': 66, 'high': 84}, 'today': {'text': 'Partly Cloudy', 'low': 64, 'high': 84}, 'country': 'China'}
  • 相关阅读:
    linux系统rewrite重定向及搭建discuz
    nginx客户端请求分类
    linux系统nginx动静分离
    RDB与AOF
    AOF持久化
    redis持久化(RDB)
    redis操作
    redis搭建
    Redis简介
    小技巧从此拒绝$?
  • 原文地址:https://www.cnblogs.com/douzujun/p/7469979.html
Copyright © 2020-2023  润新知