1 urllib 模块 - urllib module 2 3 获取 web 页面, 4 html = urllib.request.urlopen("http://www.zzyzz.top/") 5 html2 = urllib.request.Request("http://www.zzyzz.top/") 6 print("html",html) 7 print("html2",html2) 8 9 output, 10 html <http.client.HTTPResponse object at 0x0395DFF0> 11 html2 <urllib.request.Request object at 0x03613930> 12 13 Methods of HTTPResponse object, 14 geturl() — return the URL of the resource retrieved, 15 commonly used to determine if a redirect was followed 16 得到最终显示给用户的页面的 url (并不一定是所提供参数的 url, 因为有可能有 17 redirect 情况) 18 19 info() — return the meta-information of the page, such as headers, in the 20 form of an email.message_from_string() instance (see Quick Reference 21 to HTTP Headers) 22 23 getcode() – return the HTTP status code of the response. 24 25 Methods of Request object, 26 Request.full_url 27 The original URL passed to the constructor. 28 Request.full_url is a property with setter, getter and a deleter. 29 Getting full_url returns the original request URL with the fragment, 30 if it was present. 31 即 'URL' 参数(区别于 HTTPResponse object 的 geturl() 方法) 32 33 Request.type 34 The URI scheme. 35 'http' , 'https' 等 字符串 36 37 Request.host 38 The URI authority, typically a host, but may also contain a port 39 separated by a colon. 40 即 host IP Addr. (可能会同时得到 port 端口号) 41 42 Request.origin_req_host 43 The original host for the request, without port. 44 即 host IP Addr, 不含 port 信息. 45 46 Request.selector 47 The URI path. If the Request uses a proxy, then selector will be the 48 full URL that is passed to the proxy. 49 即 访问 server 的 path(相对于server 的 root 来说), 50 例如 '/' 表示 server root 跟目录. 51 52 Request.data 53 The entity body for the request, or None if not specified. 54 例如 POST 的 form 信息. urllib.request.Request("http://www.zzyzz.top/",data) 55 # data = {"Hi":"Hello"} 56 57 Request.unverifiable 58 boolean, indicates whether the request is unverifiable as defined by RFC 2965. 59 60 Request.method 61 The HTTP request method to use. By default its value is None, which means 62 that get_method()will do its normal computation of the method to be used. 63 Its value can be set (thus overriding the default computation in get_method()) 64 either by providing a default value by setting it at the class level in a 65 Request subclass, or by passing a value in to the Request constructor 66 via the method argument. 67 68 Request.get_method() 69 Return a string indicating the HTTP request method. If Request.method 70 is not None,return its value, otherwise return 'GET' if Request.data 71 is None, or 'POST' if it’s not.This is only meaningful for HTTP requests. 72 'POST' 或者 'GET' 73 74 Request.add_header(key, val) 75 Add another header to the request. Headers are currently ignored by 76 all handlers except HTTP handlers,where they are added to the list 77 of headers sent to the server. Note that there cannot be more than 78 one header with the same name, and later calls will overwrite previous 79 calls in case the key collides.Currently, this is no loss of HTTP 80 functionality, since all headers which have meaning when used more 81 than once have a (header-specific) way of gaining the same 82 functionality using only one header. 83 84 Request.add_unredirected_header(key, header) 85 Add a header that will not be added to a redirected request. 86 87 Request.has_header(header) 88 Return whether the instance has the named header (checks both 89 regular and unredirected). 90 91 Request.remove_header(header) 92 Remove named header from the request instance (both from regular 93 and unredirected headers). 94 95 Request.get_full_url() 96 Return the URL given in the constructor. 97 得到的其实是 Request.full_url 98 99 Request.set_proxy(host, type) 100 Prepare the request by connecting to a proxy server. The host and 101 type will replace those of the instance, and the instance’s selector 102 will be the original URL given in the constructor. 103 104 Request.get_header(header_name, default=None) 105 Return the value of the given header. If the header is not present, 106 return the default value. 107 108 Request.header_items() 109 Return a list of tuples (header_name, header_value) of the Request headers. 110 111 例子, 获取 html codes, 112 urlobj = urllib.request.Request("http://www.zzyzz.top/") 113 with urllib.request.urlopen(urlobj) as FH: # 文件类对象 114 print(FH.read().decode('utf8')) 115 116 Authentication, 117 当访问一个需要进行认证的 URL, 会得到一个 HTTP 401 错误,表示所访问的 URL 需要 Authentication. 118 Authentication 通常由种形式, 119 1, 浏览器 explorer 显示一个弹出框, 要求用户提供 用户名 密码进行认证, 它是基于 cookies 的. 120 2, form 表单形式的认证, 在 web 界面要求用户提供 用户名 密码, 然后通过 POST 方法将认证信息 121 发送给 server 端进行认证. 122 123 基于 cookies 的 Authentication 认证 - Basic HTTP Authentication 124 import urllib.request 125 # Create an OpenerDirector with support for Basic HTTP Authentication... 126 auth_handler = urllib.request.HTTPBasicAuthHandler() 127 auth_handler.add_password(realm= None, 128 uri="http://www.zzyzz.top/", 129 user='userid', 130 passwd='password') 131 opener = urllib.request.build_opener(auth_handler) 132 # ...and install it globally so it can be used with urlopen. 133 urllib.request.install_opener(opener) 134 html = urllib.request.urlopen("http://www.zzyzz.top/") 135 print(html.read().decode('utf8')) 136 137 基于 form 表单的 Authentication 认证, 138 再 server 端是通常这样处理, 对用户 submit(POST) 的 form 表单的数据信息做验证, 139 若验证通过 redirect 到授权页面, 否者 redirect 到 login 界面要求用户重新 POST 140 认证信息. 141 所以对于这一类的认证, 正常按照 POST form 的方法对待就可以了. 142 urlobj = urllib.request.Request("http://www.zzyzz.top/",{"id":"userid","pw":"password"}) 143 with urllib.request.urlopen(urlobj) as FH: # 文件类对象 144 print(FH.read().decode('utf8')) 145 146 异常处理 - error handling 147 urllib 异常主要分为两类, 链接错误 跟 数据错误 148 149 链接类错误(错误的 URL 地址, URL 使用了一个不支持的协议,主机名不存在 等), 150 404 Page Not Found 151 链接过程中的异常是 urllib.request.URLError 的实例, 或其子类的实例. 152 比如, urllib.request.HTTPError, 其是一种文件类对象. 153 154 例子, 155 import sys, urllib.request 156 urlobj = urllib.request.Request("http://10.240.26.249/HELLO") 157 try: 158 with urllib.request.urlopen(urlobj) as FH: 159 print(FH.read().decode('utf8')) 160 except urllib.request.HTTPError as e: 161 print("HTTPError has been detected : ", e ) 162 print("Error document : ") 163 print(e.read().decode('utf8')) 164 sys.exit(1) 165 166 except urllib.request.URLError as e: 167 print("URLError has been detected : ", e) 168 sys.exit(2) 169 170 数据类异常, 171 比如, 通信上的错误会使 socket 对象在调用 read() 方法时候发生 socket.error 异常. 172 或在数据传送过程中通信终断了 等等. 173 174 Reference, 175 https://docs.python.org/3/library/urllib.request.html#module-urllib.request