Python爬虫学习（一）使用requests库和robots协议

（一）爬虫需要的库和框架：

（二）爬虫的限制：

1，Robots协议概述：

　　　　网站拥有者可以在网站根目录下建立robots.txt文件，User-agent：定义不能访问者；Disallow定义不可以爬取的目录

　　　　例如：http://www.baidu.com/robots.txt的部分内容：　　　　　

 //不允许Baiduspider访问如下目录
User-agent: Baiduspider 
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh
//不允许Googlebot访问如下目录
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
Disallow: /home/news/data/
Disallow: /bh

　　2，Robots协议的使用：爬虫要求，类人行为爬虫可以不用遵守robots协议

（三）使用Requests库：

　　1，安装命令：pip install requests

　　2，测试

import requests;
r=requests.get("http://www.baidu.com")
print(r.status_code)
#如果requests库安装成功，则打印200
r.encoding='utf-8'
r.text    #打印网页内容

　　常见方法：　

　　　　1，requests.request（）方法：

　　　　　　1，第三个参数使用params=xx，将数据添加到url后面

　　　　　　 2，第三个参数使用data=xxx，将数据赋值到内容

　　　　　　 3，第三个参数使用json=xx，将内容赋值到json

　　　　　　4，第三个参数使用headers=字典类型数据；可以修改网页headers的属性值

　　　　　　5，第三个参数使用timeout=n，设置访问时间，当访问超时则报异常

　　　　　　6，第三个参数使用proxies=xx，设置访问的代理服务器（xx为字典数据，可以存放http或https代理），可以有效隐藏爬虫原IP，可以有效防止爬虫逆追踪

　　　　　　例一：

　　　　例二：第三参数使用data=xxx，将数据赋值到内容　　　

import requests;
from _socket import timeout
#from aifc import data               
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)  
        r.raise_for_status()           #如果连接状态不是200，则引发HTTPError异常
        r.encoding=r.apparent_encoding #使返回的编码正常
        print("连接成功")
        return r.status_code
    except:
        print("连接异常")
        return r.status_code

url="http://httpbin.org/"
if getHTMLText(url)==200:
    payData={'name;':'zs',"sex":"man"};
    r=requests.request("POST",url+"post",data=payData)  #第一个参数赋什么值，第二个参数在url后就添加什么参数
    print(r.text)

　　　　　　运行截图：

　　　　2，requests.get（）方法：**kwargs是request方法中除了params外的其他12个参数（最常用的方法）

　　　　　　r.encoding：当连接的网页的<head>标签没有charset属性，则认为编码为ISO-8859-1

　　　　　　当连接失败时：

　　　　　　抛出异常方法：

#判断url是否可以正常连接的通用方法：
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)  
        r.raise_for_status()           #如果连接状态不是200，则引发HTTPError异常
        r.encoding=r.apparent_encoding #使返回的编码正常
        return r.text
    except:
        return "连接异常"

　　　 3，requests.head（）方法：

　　　4，requests.post（）方法：

　　 5，requests.put（）方法：

　　 6，requests.patch（）方法：

　　7，requests.delete（）方法：

　　8，HTTP协议：

（四）爬取网站数据案例；

　　例一：爬取起点某小说某章节

import requests;
from _socket import timeout
#from aifc import data
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)  
        r.raise_for_status()           #如果连接状态不是200，则引发HTTPError异常
        r.encoding=r.apparent_encoding #使返回的编码正常
        print("连接成功")
        return r.status_code
    except:
        print("连接异常")
        return r.status_code

url="https://read.qidian.com/chapter/z4HnwebpkxdqqtWmhQLkJA2/pYlct39tiiVOBDFlr9quQA2"
access={"user-agent":"Mozilla/5.0"}              #设置访问网站为浏览器Mozilla5.0
if getHTMLText(url)==200:
    #pst={"http":"http://user:pass@10.10.10.1:1234","https":"https://10.10.10.1:4321"} #设置代理
    r=requests.get(url,headers=access)         #访问网站设置代理服务器
    #r.encoding="utf-8"
    print(r.request.headers)            #获取头部信息
　　print(r.text[2050:3000])

　　截图：

　　例二：保存网上各种格式文件到本地：

import os
import requests
url="http://a3.att.hudong.com/53/88/01200000023787136831885695277.jpg"  #可以是图片，音乐，视频等的网上地址
access={"user-agent":"Mozilla/5.0"}
file="D://picture//"      #定义图片保存的本地文件
path=file+url.split("/")[-1]
print(path)
try:
    if not os.path.exists(file):          #如果不存在目录D://picture就新建
        os.mkdir(file)
    if not os.path.exists(path):
        r=requests.get(url,headers=access)
        with open(path,"wb") as f:         #打开文件path作为f；
            f.write(r.content)              #r.content：爬取的文件的二进制形式
            f.close()
            print("图片保存成功")
    else:
        print("文件已保存")
except:
    print("爬取失败")

　　例三：

相关阅读:
Aizu 0525 Osenbei 搜索 A
PAT 1088 三人行模拟,坑 C
POJ1862 Stripies 贪心 B
ZOJ 4109 Welcome Party 并查集+优先队列+bfs
POJ 3685 Matrix
POJ 3579 Median 二分加判断
 Educational Codeforces Round 63 D. Beautiful Array
Codeforces Round #553 (Div. 2) C
HDU 5289
Codeforces 552 E. Two Teams
原文地址：https://www.cnblogs.com/lq13035130506/p/12243806.html