Python学习笔记（四十一）— 内置模块（10）urllib

摘抄自：https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001432688314740a0aed473a39f47b09c8c7274c9ab6aee000

Get

urllib的request模块可以非常方便地抓取URL内容，也就是发送一个GET请求到指定的页面，然后返回HTTP的响应：

例如，对豆瓣的一个URLhttps://api.douban.com/v2/book/2129650进行抓取，并返回响应：

from urllib import request

with request.urlopen('https://api.douban.com/v2/book/2129650') as f:
    data = f.read()
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', data.decode('utf-8'))

Status: 200 OK
Date: Sun, 03 Sep 2017 08:41:22 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 2058
Connection: close
Vary: Accept-Encoding
X-Ratelimit-Remaining2: 97
X-Ratelimit-Limit2: 100
Expires: Sun, 1 Jan 2006 01:00:00 GMT
Pragma: no-cache
Cache-Control: must-revalidate, no-cache, private
Set-Cookie: bid=dDzHhyeuVQ0; Expires=Mon, 03-Sep-18 08:41:22 GMT; Domain=.douban.com; Path=/
X-DOUBAN-NEWBID: dDzHhyeuVQ0
X-DAE-Node: sindar15a
X-DAE-App: book
Server: dae
Data: {"rating":{"max":10,"numRaters":16,"average":"7.4","min":0},"subtitle":"","author":["廖雪峰"],"pubdate":"2007","tags":[{"count":21,"name":"spring","title":"spring"},{"count":13,"name":"Java","title":"Java"},{"count":6,"name":"javaee","title":"javaee"},{"count":5,"name":"j2ee","title":"j2ee"},{"count":4,"name":"计算机","title":"计算机"},{"count":3,"name":"藏书","title":"藏书"},{"count":3,"name":"编程","title":"编程"},{"count":3,"name":"POJO","title":"POJO"}],"origin_title":"","image":"https://img3.doubanio.com/mpic/s2552283.jpg","binding":"平装","translator":[],"catalog":"","pages":"509","images":{"small":"https://img3.doubanio.com/spic/s2552283.jpg","large":"https://img3.doubanio.com/lpic/s2552283.jpg","medium":"https://img3.doubanio.com/mpic/s2552283.jpg"},"alt":"https://book.douban.com/subject/2129650/","id":"2129650","publisher":"电子工业出版社","isbn10":"7121042622","isbn13":"9787121042621","title":"Spring 2.0核心技术与最佳实践","url":"https://api.douban.com/v2/book/2129650","alt_title":"","author_intro":"","summary":"本书注重实践而又深入理论，由浅入深且详细介绍了Spring 2.0框架的几乎全部的内容，并重点突出2.0版本的新特性。本书将为读者展示如何应用Spring 2.0框架创建灵活高效的JavaEE应用，并提供了一个真正可直接部署的完整的Web应用程序——Live在线书店(http://www.livebookstore.net)。
在介绍Spring框架的同时，本书还介绍了与Spring相关的大量第三方框架，涉及领域全面，实用性强。本书另一大特色是实用性强，易于上手，以实际项目为出发点，介绍项目开发中应遵循的最佳开发模式。
本书还介绍了大量实践性极强的例子，并给出了完整的配置步骤，几乎覆盖了Spring 2.0版本的新特性。
本书适合有一定Java基础的读者，对JavaEE开发人员特别有帮助。本书既可以作为Spring 2.0的学习指南，也可以作为实际项目开发的参考手册。","price":"59.8"}

如果我们要想模拟浏览器发送GET请求，就需要使用Request对象，通过往Request对象添加HTTP头，我们就可以把请求伪装成浏览器。例如，模拟iPhone 6去请求豆瓣首页：

from urllib import request

# 模拟浏览器发送GET请求，就需要使用Request对象
req = request.Request('http://www.douban.com/')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

.......
<meta name="viewport" content="width=device-width, height=device-height, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">
        <meta name="format-detection" content="telephone=no">
        <link rel="canonical" href="https://m.douban.com/">
        <link href="https://img3.doubanio.com/f/talion/3c45a4b3705e30953879f6078082cbd1b9f88858/css/card/base.css" rel="stylesheet">
.......

Post

如果要以POST发送一个请求，只需要把参数data以bytes形式传入。

我们模拟一个微博登录，先读取登录的邮箱和口令，然后按照weibo.cn的登录页的格式以username=xxx&password=xxx的编码传入：

from urllib import request, parse

print('Login to weibo.cn......')
email = input('Email: ')
passwd = input('Password: ')
login_data = parse.urlencode([
    ('username', email),
    ('password', passwd),
    ('entry', 'mweibo'),
    ('client_id', ''),
    ('savestate', '1'),
    ('ec', ''),
    ('pagerefer', 'https://passport.weibo.cn/signin/welcome?entry=mweibo&r=http%3A%2F%2Fm.weibo.cn%2F')
])

req = request.Request('https://passport.weibo.cn/sso/login')
req.add_header('Origin', 'http://passport.weibo.cn')
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
req.add_header('Referer', 'https://passport.weibo.cn/signin/login?entry=mweibo&res=wel&wm=3349&r=http%3A%2F%2Fm.weibo.cn%2F')

with request.urlopen(req, data=login_data.encode('utf-8')) as f:
    print('State:', f.status, f.reason)
    for k, v in f.getheaders():
        print('%s: %s' % (k, v))
    print('Data:', f.read().decode('utf-8'))

State: 200 OK
Server: nginx/1.6.1
Date: Sun, 03 Sep 2017 11:31:56 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: no-cache, must-revalidate
Expires: Sat, 26 Jul 1997 05:00:00 GMT
Pragma: no-cache
Access-Control-Allow-Origin: http://passport.weibo.cn
Access-Control-Allow-Credentials: true
Set-Cookie: SUB=_2A250r5h8DeThGeBN7lUY9yrOzT2IHXVUUzg0rDV6PUJbkdBeLXPnkW08RRwH9G8I4bQbO4O9n3iyqeIP8g..; Path=/; Domain=.weibo.cn; Expires=Mon, 03 Sep 2018 11:31:56 GMT; HttpOnly
Set-Cookie: SUHB=0M2Veoz3CDVYaB; expires=Monday, 03-Sep-2018 11:31:56 GMT; path=/; domain=.weibo.cn
Set-Cookie: SCF=Ah1KXnqURq1Vwg0pcnz1J2hopmgB_WeMnJp9lOca0OIZ5xbPll3pP4EXHcrcZF3U5QuKuhvMlNKw9Vr8u3coL14.; expires=Wednesday, 01-Sep-2027 11:31:56 GMT; path=/; domain=.weibo.cn; httponly
Set-Cookie: SSOLoginState=1504438316; path=/; domain=weibo.cn
Set-Cookie: ALF=1507030316; expires=Tuesday, 03-Oct-2017 11:31:56 GMT; path=/; domain=.sina.cn
DPOOL_HEADER: dryad62
SINA-LB: aGEuMTI3LmcxLm5mamQubGIuc2luYW5vZGUuY29t
SINA-TS: Y2ZjYTk0Y2UgMCAwIDAgOSAzODYK
Data: {"retcode":20000000,"msg":"","data":{"loginresulturl":"https://passport.weibo.com/sso/crossdomain?entry=mweibo&action=login&proj=1&ticket=ST-NjM1Nzk3NDI2MQ%3D%3D-1504438316-gz-509C6EAFA74DA5C86B1AEB13AEB7D6B8-1&display=0&cdurl=https%3A%2F%2Flogin.sina.com.cn%2Fsso%2Fcrossdomain%3Fentry%3Dmweibo%26action%3Dlogin%26proj%3D1%26ticket%3DST-NjM1Nzk3NDI2MQ%253D%253D-1504438316-gz-46B914F433231C881EA55B8D2E8FBE98-1%26display%3D0%26cdurl%3Dhttps%253A%252F%252Fpassport.sina.cn%252Fsso%252Fcrossdomain%253Fentry%253Dmweibo%2526action%253Dlogin%2526display%253D0%2526ticket%253DST-NjM1Nzk3NDI2MQ%25253D%25253D-1504438316-gz-94CAAA0133A8B28346F7993B8357F442-1","uid":"6357974261"}}

如果登录失败，得到的响应：

...
Data: {"retcode":50011015,"msg":"u7528u6237u540du6216u5bc6u7801u9519u8bef","data":{"username":"example@python.org","errline":536}}

Handler

如果还需要更复杂的控制，比如通过一个Proxy去访问网站，我们需要利用ProxyHandler来处理，示例代码如下：（待理解..........）

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open('http://www.example.com/login.html') as f:
    pass

小结

urllib提供的功能就是利用程序去执行各种HTTP请求。如果要模拟浏览器完成特定功能，需要把请求伪装成浏览器。伪装的方法是先监控浏览器发出的请求，再根据浏览器的请求头来伪装，User-Agent头就是用来标识浏览器的。

练习

利用urllib读取XML，将XML一节的数据由硬编码改为由urllib获取：

# 查询天气
# -*- coding: utf-8 -*-

import urllib.request, urllib.parse
from xml.parsers.expat import ParserCreate

class weatherSaxHandler(object):
    def __init__(self):
        self._location = {}
        self._forcast = []

    def start_element(self, name, attrs):
        if name == 'yweather:location':
            self._location = attrs
            attrs.pop('xmlns:yweather')
        if name == 'yweather:forecast':
            self._forcast.append(attrs)
    def end_element(self, name):
        pass
    def char_data(self, text):
        pass

def parse_weather(xml):   # 输入xml字符串, 输出天气信息dict
    parser = ParserCreate()
    handler = weatherSaxHandler()
    parser.StartElementHandler = handler.start_element
    parser.EndElementHandler = handler.end_element
    parser.CharacterDataHandler = handler.char_data
    parser.Parse(xml)
    today = {
        'text': handler._forcast[0]['text'],
        'low': int(handler._forcast[0]['low']),
        'high': int(handler._forcast[0]['high'])
    }
    tomorrow = {
        'text': handler._forcast[1]['text'],
        'low':  int(handler._forcast[1]['low']),
        'high': int(handler._forcast[1]['high'])
    }
    d = {
        'today' : today,
        'tomorrow': tomorrow
    }
    weather = handler._location
    weather.update(d)
    return weather

def get_weather(city):  # 输入城市名(拼音) 字符串, 输出天气dict
    baseurl = "https://query.yahooapis.com/v1/public/yql?"
    yql_query = 'select * from weather.forecast where woeid in (select woeid from geo.places(1) where text="%s")' % city
    yql_url = baseurl + urllib.parse.urlencode({'q':yql_query})
    with urllib.request.urlopen(yql_url) as f:
        city_xml = f.read().decode('utf-8')
    city_weather = parse_weather(city_xml)
    return city_weather

if __name__ == '__main__':
    city = input('Weather Forecast in City:')
    print(get_weather(city))

Weather Forecast in City:Beijing
{'city': 'Beijing', 'region': ' Beijing', 'tomorrow': {'text': 'Mostly Cloudy', 'low': 66, 'high': 84}, 'today': {'text': 'Partly Cloudy', 'low': 64, 'high': 84}, 'country': 'China'}

相关阅读:
linux系统rewrite重定向及搭建discuz
nginx客户端请求分类
 linux系统nginx动静分离
 RDB与AOF
AOF持久化
 redis持久化(RDB)
redis操作
 redis搭建
 Redis简介
 小技巧从此拒绝$?
原文地址：https://www.cnblogs.com/douzujun/p/7469979.html