爬取整个网站[爬虫进阶笔记]

从爬取一页数据到爬取所有数据

　　　　　先说一下静态网页爬虫的大概流程

数据加载方式
通过点击第二页发现，网站后面多了 ?start=25 字段
这部分被称为 查询字符串，查询字符串作为用于搜索的参数或处理的数据传送给服务器处理，格式是 ?key1=value1&key2=value2。
我们多翻几页豆瓣读书的页面，观察一下网址的变化规律：
不难发现：第二页 start=25，第三页 start=50，第十页 start=225，而每页的书籍数量是 25。
因此 start 的计算公式为 start = 25 * (页码数 - 1)（25 为每页展示的数量）。
可以写一段代码自动生成所有所要查找的网页地址

 1 url = 'https://book.douban.com/top250?start={}'
 2 # num 从 0 开始因此不用再 -1
 3 urls = [url.format(num * 25) for num in range(10)]
 4 print(urls)
 5 # 输出：
 6 # [
 7 #   'https://book.douban.com/top250?start=0',
 8 #   'https://book.douban.com/top250?start=25',
 9 #   'https://book.douban.com/top250?start=50',
10 #   'https://book.douban.com/top250?start=75',
11 #   'https://book.douban.com/top250?start=100',
12 #   'https://book.douban.com/top250?start=125',
13 #   'https://book.douban.com/top250?start=150',
14 #   'https://book.douban.com/top250?start=175',
15 #   'https://book.douban.com/top250?start=200',
16 #   'https://book.douban.com/top250?start=225'
17 # ]

生成所有所需网页地址

有了所有网页的链接后，我们就可以爬取整个网站的数据了

 1 import requests
 2 import time
 3 from bs4 import BeautifulSoup
 4 
 5 # 将获取豆瓣读书数据的代码封装成函数
 6 def get_douban_books(url):
 7   headers = {
 8     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
 9   }
10   res = requests.get(url, headers=headers)
11   soup = BeautifulSoup(res.text, 'html.parser')
12   items = soup.find_all('div', class_='pl2')
13   for i in items:
14     tag = i.find('a')
15     name = tag['title']
16     link = tag['href']
17     print(name, link)
18 
19 url = 'https://book.douban.com/top250?start={}'
20 urls = [url.format(num * 25) for num in range(10)]
21 for item in urls:
22   get_douban_books(item)
23   # 暂停 1 秒防止访问太快被封
24   time.sleep(1)

爬取整个网站

反爬虫：限制频繁、非正常网页浏览

不管是浏览器还是爬虫，访问网站时都会带上一些信息用于身份识别。而这些信息都被存储在一个叫请求头（request headers）的地方。
服务器会通过请求头里的信息来判别访问者的身份。请求头里的字段有很多，我们暂时只需了解 user-agent（用户代理）即可。user-agent 里包含了操作系统、浏览器类型、版本等信息，通过修改它我们就能成　　　　　　功地伪装成浏览器。

requests 的官方文档（http://cn.python-requests.org/zh_CN/latest/）

判别身份是最简单的一种反爬虫方式，我们也能通过一行代码，将爬虫伪装成浏览器轻易地绕过这个限制。所以，大部分网站还会进行 IP 限制防止过于频繁的访问。

IP（Internet Protocol）全称互联网协议地址，意思是分配给用户上网使用的网际协议的设备的数字标签。你可以将 IP 地址理解为门牌号，我只要知道你家的门牌号就能找到你家。
当我们爬取大量数据时，如果我们不加以节制地访问目标网站，会使网站超负荷运转，一些个人小网站没什么反爬虫措施可能因此瘫痪。而大网站一般会限制你的访问频率，因为正常人是不会在 1s 内访问几十次甚至上百次网站的。
常使用 time.sleep() 来降低访问的频率
也可以使用代理来解决 IP 限制的问题即通过别的 IP 访问网站
官方文档—— https://cn.python-requests.org/zh_CN/latest/user/advanced.html#proxies

1 import requests
2 
3 proxies = {
4   "http": "http://10.10.1.10:3128",
5   "https": "http://10.10.1.10:1080",
6 }
7 
8 requests.get("http://example.org", proxies=proxies)

在爬取大量数据时我们需要很多的 IP 用于切换。因此，我们需要建立一个 IP 代理池（列表），每次从中随机选择一个传给 proxies 参数。

我们来看一下如何实现：

 1 import requests
 2 import random
 3 from bs4 import BeautifulSoup
 4 
 5 def get_douban_books(url, proxies):
 6   headers = {
 7     'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
 8   }
 9   # 使用代理爬取数据
10   res = requests.get(url, proxies=proxies, headers=headers)
11   soup = BeautifulSoup(res.text, 'html.parser')
12   items = soup.find_all('div', class_='pl2')
13   for i in items:
14     tag = i.find('a')
15     name = tag['title']
16     link = tag['href']
17     print(name, link)
18 
19 url = 'https://book.douban.com/top250?start={}'
20 urls = [url.format(num * 25) for num in range(10)]
21 # IP 代理池（瞎写的并没有用）
22 proxies_list = [
23   {
24     "http": "http://10.10.1.10:3128",
25     "https": "http://10.10.1.10:1080",
26   },
27   {
28     "http": "http://10.10.1.11:3128",
29     "https": "http://10.10.1.11:1080",
30   },
31   {
32     "http": "http://10.10.1.12:3128",
33     "https": "http://10.10.1.12:1080",
34   }
35 ]
36 for i in urls:
37   # 从 IP 代理池中随机选择一个
38   proxies = random.choice(proxies_list)
39   get_douban_books(i, proxies)

代理池的实现

爬虫中的君子协议——robots.txt

robots.txt 是一种存放于网站根目录下的文本文件，用于告诉爬虫此网站中的哪些内容是不应被爬取的，哪些是可以被爬取的。

我们只要在网站域名后加上 /robots.txt 即可查看，

比如豆瓣读书的 robots.txt 地址是：https://book.douban.com/robots.txt。打开它后的内容如下：

1 User-agent: *
2 Disallow: /subject_search
3 Disallow: /search
4 Disallow: /new_subject
5 Disallow: /service/iframe
6 Disallow: /j/
7 
8 User-agent: Wandoujia Spider
9 Disallow: /

User-agent: * 表示针对所有爬虫（* 是通配符），接下来是符合该 user-agent 的爬虫要遵守的规则。比如 Disallow: /search 表示禁止爬取 /search 这个页面，其他同理。

相关阅读:
【Leetcode】反转链表 II
将博客搬至CSDN
UVA 11021（概率）
zoj
Codeforces Round #227 (Div. 2) / 387C George and Number (贪心)
点头（1163）
fzu-2164 Jason's problem（数论）
nyist --ACM组队练习赛（链接）
nyoj-括号匹配（二）15---动态规划
 动态规划
原文地址：https://www.cnblogs.com/Vowzhou/p/15971237.html

爬取整个网站[爬虫进阶笔记]

从爬取一页数据到爬取所有数据

判别身份是最简单的一种反爬虫方式，我们也能通过一行代码，将爬虫伪装成浏览器轻易地绕过这个限制。所以，大部分网站还会进行 IP 限制 防止过于频繁的访问。

在爬取大量数据时我们需要很多的 IP 用于切换。因此，我们需要建立一个 IP 代理池（列表），每次从中随机选择一个传给 proxies 参数。

爬虫中的君子协议——robots.txt

判别身份是最简单的一种反爬虫方式，我们也能通过一行代码，将爬虫伪装成浏览器轻易地绕过这个限制。所以，大部分网站还会进行 IP 限制防止过于频繁的访问。