爬虫详解

Python爬虫基础

一,爬虫总览

1 介绍爬虫
2 urilib3（内置，不好用），requests--模拟发送http请求
3 Beautifulsoup解析，xpth解析
4 selenium模块，操作浏览器
5 Mongodb
6 反爬：代理池，cookie池，请求头中：user-agent，refer，js逆向
7 爬虫框架scrapy，爬虫界的django
8 scrapy-redis分布式爬虫
9 模拟登陆网站
10 爬取视频，爬肯德基门店地址，爬红楼梦书
11 验证码破解（打码平台：超级鹰）
12 破解12306
13 抓包工具的使用（fiddler，charls）
14 安卓app的爬取，逆向

1.1 爬虫介绍

1 爬虫：网络蜘蛛
2 爬虫本质：模拟浏览器发送请求（requests，selenium）->下载网页代码->只提取有用的数据（bs4，xpath，re）->存放于数据库或文件中（文件，excel，mysql，redis，mongodb）
3 发送请求：请求地址（浏览器调试，抓包工具），请求头（难），请求体（难），请求方法
4 拿到响应：拿到响应体（json格式，xml格式，html格式（bs4,xpath），加密的未知格式(需要解密)）
5 入库：Mongodb（json格式数据）
6 性能高一些（多线程，多进程，协程)
7 scrapy框架处理了性能

1.2 python 内置爬虫模块

对于一些简单的爬虫，python（基于python3）有更好的第三方库来实现它，且容易上手。

Python标准库–logging模块
logging模块能够代替print函数的功能，将标准输出到日志文件保存起来，利用loggin模块可以部分替代debug
re模块
正则表达式
sys模块
系统相关模块

二 requests模块使用

2.1 发送get/post请求

请求头:

模拟网站发送get请求,一般在请求头添加一些参数来使模拟的请求看起来更像一些

'user-agent': 该参数记录了发送请求的系统和浏览器信息,一般网站都会检查例如:

'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',

'referer': 该参数是为了防止盗用图片的,假如用一个网站所有的图片都是一个请求连接,该连接指向其他的服务器,这就是图片盗链,这个参数一般是同一个网站的地址.

如:'https://www.mzitu.com/225078/2'

cookies 由于这个参数经常使用所以用cookies={'key':'value'}直接传给requests

import requests
res=requests.get('http://127.0.0.1:8000/index/',headers=header,cookies={'key':'asdfasdf'})


自动携带cookie
import requests
session=requests.session()    #第一次登陆后每次请求都会自动携带cookie
res=session.post('http://127.0.0.1:8000/index/')  # 假设这个请求登录了
res1=session.get('http://127.0.0.1:8000/order/')  # 现在不需要手动带cookie，session会帮咱处理

这些信息通常以键值对的方式放入请求头中:

header = {
     'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
     'referer': 'https://www.mzitu.com/225078/2'
 }
res = requests.get('https://www.mzitu.com/', headers=header)

print(res.text)   #文本内容
print(res.content)  # 二进制内容

#将获取的内容写入文件
with open('a.jpg', 'wb')as f:
     for line in res1.iter_content():
         f.write(line)

携带参数:

携带参数的方式有两种,1:使用模块编译拼在路径后面,2:以params{}传参数

1,拼在路径后面
from urllib.parse import urlencode,unquote
print(urlencode('美女')) 编码
print(unquote('%E7%BE%8E%E5%A5%B3')) 解码
'https://www.baidu.com/s?wd=%E7%BE%8E%E5%A5%B3'

2,以params{}传参数 搜索条件简写wd
res=requests.get('https://www.baidu.com/s',headers=header,params={'wd':'美女'})

POST请求体的参数:

# 发送post请求，携带数据（urlencoded和json）放在data中
res=requests.post('http://127.0.0.1:8000/index/',data={'name':'lqz'})
print(res.text)

res=requests.post('http://127.0.0.1:8000/index/',json={'age':1,},)
print(res.text)

2.2 Response对象

respone=requests.post('http://127.0.0.1:8000/index/',data={'name':'lqz'})


print(respone.text)  # 响应的文本
print(respone.content)  # 响应体的二进制

print(respone.status_code)  # 响应状态码
print(respone.headers)    # 响应头
print(respone.cookies)   # cookie
print(respone.cookies.get_dict()) #  把cookie转成字典
print(respone.cookies.items())  # 取出cookie的key和value

print(respone.url)        # 请求的url
print(respone.history)   #[]放重定向之前的地址

print(respone.encoding)  # 响应的编码方式

respone.iter_content()  # 图片，视频，大文件，一点一点循环取出来
for line in respone.iter_content():
     f.write(line)

2.3 重要方法

乱码问题的解决

res=requests.get('http://www.autohome.com/news')
#  一旦打印出来出现乱码问题
#  方式一手动指定编码
res.encoding='gb2312'
#  方式二自动获取编码格式编码
res.encoding=res.apparent_encoding
print(res.text)

解析json

respone=requests.post('http://127.0.0.1:8000/index/',data={'name':'lqz'})

print(type(respone.json()))  # 将json格式数据解析为字典

高级用法之ssl

import requests
 respone=requests.get('https://www.12306.cn',verify=False) #不验证证书,
 print(respone.status_code)

使用代理

# 代理池：列表放了一堆代理ip，每次随机取一个，再发请求就不会封ip了
# 高匿和透明代理？如果使用高匿代理，后端无论如何拿不到你的ip，使用透明，后端能够拿到你的ip
# 后端如何拿到透明代理的ip，  后端：X-Forwarded-For
 respone=requests.get('https://www.baidu.com/',proxies={'http':'27.46.20.226:8888',})
 print(respone.text)

超时设置

import requests
respone=requests.get('https://www.baidu.com',
                     timeout=0.0001)

异常处理

import requests
from requests.exceptions import * #可以查看requests.exceptions获取异常类型

 try:
    r=requests.get('http://www.baidu.com',timeout=0.00001)
 except ReadTimeout:
     print('===:')
 except Exception as e:
     print(e)

上传文件

# 爬虫一般不会使用该功能,该功能常被一些SDK使用

res=requests.post('http://127.0.0.1:8000/index/',files={'myfile':open('a.jpg','rb')})
 print(res.text)

2.4 爬取视频实战

import requests
import re

header = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',
    'referer': 'https://www.pearvideo.com/',

}

response1 = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=8&start=0',headers=header,)
# print(response1.text)
re_video = '<a href="(.*?)" class="vervideo-lilink actplay">'
video_url = re.findall(re_video,response1.text)
# print(video_url)

for i in video_url:
    url = 'https://www.pearvideo.com/'+i
    response2 = requests.get(url)
    re_mp4 = 'hdflvUrl="",sdflvUrl="",hdUrl="",sdUrl="",ldUrl="",srcUrl="(.*?)",'
    mp4_url = re.findall(re_mp4,response2.text)[0]

    name = mp4_url.rsplit('/',1)[-1]
    print(name)
    video_content = requests.get(mp4_url)
    print(video_content)
    for line in video_content.iter_content():
        with open(name,'wb') as f1:
            f1.write(line)
    print('ok')

相关阅读:
SVN
jenkins可视化
 Tomcat
防火墙
 keepalived
MHA高可用
 http
inotifywait
DNS
nginx
原文地址：https://www.cnblogs.com/Franciszw/p/13424458.html