一,什么是爬虫?
描述: 本质是一个自动化程序,一个模拟浏览器向某一个服务器发送请求获取响应资源的过程.
爬虫的基本流程
robots.txt协议
编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。
二,http协议
import requests '''1,GET:GET可以说是最常见的了,它本质就是 发送一个请求来取得服务器上的某一资源。资源通过 一组HTTP头和呈现据(如HTML文本,或者图片或者视频等) 返回给客户端。GET请求中,永远不会包含呈现数据。''' res= requests.get("http://httpbin.org/get") print("get请求>>>>res:",res.text) '''2,POST请求同PUT请求类似,都是向服务器端发送数据的,但是该请求会改变数据的种类等资源, 就像数据库的insert操作一样,会创建新的内容。几乎目前所有的提交操作都是用POST请求的。有请求体''' res1=requests.post("http://httpbin.org/post") print("post请求>>>>res1",res1.text) '''3、PUT请求是向服务器端发送数据的,从而改变信息,该请求就像数据库 的update操作一样,用来修改数据的内容,但是不会增加数据的种类等,也就 是说无论进行多少次PUT操作,其结果并没有不同。''' res2=requests.put("http://httpbin.org/put") print("put请求>>>>res2:",res2.text) '''4删除请求''' res3=requests.delete("http://httpbin.org/delete") print("delete删除请求>>>>res3:",res3.text) '''5,HEAD请求常常被忽略,但是能提供很多有用的信息,特别是在有限的速度和带宽下。主要有以下特点: 1、只请求资源的首部; 2、检查超链接的有效性; 3、检查网页是否被修改; 4、多用于自动搜索机器人获取网页的标志信息,获取rss种子信息,或者传递安全认证信息等''' res4=requests.head("http://httpbin.org/get") print(">>>>res4:",res4.text) '''6,options请求是用于请求服务器对于某些接口等资源的支持情况的,包括各种请求方法、 头部的支持情况,仅作查询使用。来个栗子''' res5=requests.options("http://httpbin.org/get") print(">>>>res5:",res5.text)
D:python3.6python.exe F:/爬虫/request请求方式.py get请求>>>>res: { "args": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/get" } post请求>>>>res1 { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "0", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/post" } put请求>>>>res2: { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Content-Length": "0", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/put" } delete删除请求>>>>res3: { "args": {}, "data": "", "files": {}, "form": {}, "headers": { "Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Host": "httpbin.org", "User-Agent": "python-requests/2.21.0" }, "json": null, "origin": "61.144.173.127, 61.144.173.127", "url": "https://httpbin.org/delete" } >>>>res4: >>>>res5:
三,requests的属性和方法
requests的下载 : pip install requests
import requests ########一 get基本请求请求 # 案例1:爬取京东首页 # url ="https://www.jd.com/" # res1=requests.get(url) # with open("jd.html","w",encoding="utf-8") as f: # f.write(res1.text) ########二 含参数,请求头 # 案例1:爬取百度搜索首页 # url2 ="https://image.baidu.com/" # res2=requests.get(url2,params={ # "wd":"刘传盛" # }, # headers={ # "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" # } # ) # with open("baidu.html","wb") as f: # f.write(res2.content) #案例2 爬取抽屉网 # url="https://dig.chouti.com/" #该网站做ua反扒所以需携带User-Agent参数 # res=requests.get(url, # headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" # }) # with open("chouti.html","wb") as f: # f.write(res.content) ########三 cookie参数 # import uuid #随机生成uuid字符串 # import requests # url = 'http://httpbin.org/cookies' # cookies={"sbid":str(uuid.uuid4()),"a":"1"} # res=requests.get(url,cookies=cookies) # print(res.text) ########三 session对象 # res=requests.post("/login/") # dic={} # requests.get("/index/",cookies=dic) # # session=requests.session() # session.post("/login/") # session.get("/index/") ########四 post请求 # res1=requests.post(url="http://httpbin.org/post?a=1",data={ # "name":"yuan" # }) # print(res1.text) # # res2=requests.post(url="http://httpbin.org/post?a=1",data={ # "name":"alex" # }) # print(res2.text) ########五 IP代理 res=requests.get('http://httpbin.org/ip', proxies={'http':'111.177.177.87:9999'}).json() print(res)
四,resquests的爬取数据的方法
import requests ######## 一 content text # response=requests.get("http://www.autohome.com.cn/beijing/") #爬取文件方式1 # print(response.content) #打印的是一堆字节 # print(response.encoding) #打印爬取数据所用的编码 # print(response.text) #打印文本文件 # response.encoding="gbk" # with open("autohome.html","wb") as f: # f.write(response.content) #爬取文件方式2 # with open("autohome.html","wb") as f: # f.write(response.content) ######## 二 爬取图片,音频视频 # res=requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551598670426&di=cd0b0fe51a124afed16efad2269215ae&imgtype=0&src=http%3A%2F%2Fn.sinaimg.cn%2Fsinacn22%2F23%2Fw504h319%2F20180819%2Fb69e-hhxaafy7949630.jpg") # with open("鞠.jpg","wb") as f : # f.write(res.content) # # res1=requests.get("http://y.syasn.com/p/p95.mp4") # with open("xiao.mp4","wb") as f: # for line in res1.iter_content(): # f.write(line) ######## 三 响应json数据 # res=requests.get("http://httpbin.org/get") # print(res.text) # print(type(res.text)) #打印结果 <class 'str'> # import json # print(json.loads(res.text)) # '''打印结果 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': # 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'}, # 'origin': '61.144.173.127, 61.144.173.127', 'url': 'https://httpbin.org/get'}''' # print(type(json.loads(res.text))) #打印结果<class 'dict'> # print("-----------") # print(res.json()) # print(type(res.json())) ######## 四 重定向 # res=requests.get("http://www.jd.com/") # print(res.history) #[<Response [302]>] # print(res.text) # print(res.status_code) #200 # res=requests.get("http://www.jd.com/" ,allow_redirects=False)#allow_redirects不能重定向 # print(res.history) #[] # print(res.status_code) #302
五案例爬取githu的首页
#该网站做了反扒措施,不能直接爬取首页,需先爬取login登录页面,获取token值 import requests import re session=requests.session() # login请求:目的获取动态token值 res1=session.get("https://github.com/login") token=re.findall('<input type="hidden" name="authenticity_token" value="(.*?)" />',res1.text,re.S)[0] print(token) res2=session.post("https://github.com/session",data={ "commit": "Sign in", "utf8":"✓", "authenticity_token": token, "login": "yuanchenqi0316@163.com", "password": "yuanchenqi0316" }) # res=requests.get("https://github.com/settings/emails") with open("github.html","wb") as f: f.write(res2.content) print(res2.history)
六重点:请求格式
请求协议格式: 请求首行 请求头 空行 请求体 请求头: content-type 浏览器------------------>服务器 1 针对post请求(post请求才有请求体) 2 向服务器发送post请求有哪些形式: form表单 (urlencoded编码格式) user=yuan pwd=123 Ajax(urlencoded编码格式) a=1 b=2 请求协议格式字符串 发送urlencoded编码格式数据 ''' 请求首行 请求头 content-type:"urlencoded" 空行 请求体 # user=yuan&pwd=123 urlencoded编码格式 ''' 发送json数据 ''' 请求首行 请求头 content-type:"json" 空行 请求体 # {"user":"yuan","pwd":123} json编码格式