路飞学城-Python爬虫实战密训-第1章

正式的开始学习爬虫知识，Python是一门接触就会爱上的语言。路飞的课真的很棒，课程讲解不是告诉你结论，而是在告诉你思考的方法和过程。

第一章，学习了如何爬取汽车之家以及抽屉登录并点赞。

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 
 5 # 登录github并返回帐户名称
 6 def get_info(user_name, user_password):
 7     headers = {
 8         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
 9                       '(KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'}
10     try:
11         # 获取cookie及token信息
12         ret = requests.get(
13             url='https://github.com/login',
14             headers=headers
15         )
16         cookie_dict1 = ret.cookies.get_dict()
17         res = BeautifulSoup(ret.text, 'html.parser')
18         token = res.find(name='input', attrs={'type': 'hidden', 'name': 'authenticity_token'}).get('value')
19         # 请求头，完成github登录
20         ret2 = requests.post(
21             url='https://github.com/session',
22             data={
23                 'commit': 'Sign in',
24                 'utf8': '✓',
25                 'authenticity_token': token,
26                 'login': user_name,
27                 'password': user_password
28             },
29             headers=headers,
30             cookies=cookie_dict1
31         )
32         cookie_dict2 = ret2.cookies.get_dict()
33         # 获取用户信息数据
34         ret3 = requests.get(
35             url='https://github.com/'+user_name,
36             headers=headers,
37             cookies=cookie_dict2
38         )
39         res3 = BeautifulSoup(ret3.text, 'html.parser')
40         div = res3.find(name='div', attrs={'id': 'js-pjax-container'})
41         h1 = div.find(name='h1', attrs={'class': 'vcard-names'})
42         name = h1.find(name='span', attrs={'class': 'p-name vcard-fullname d-block overflow-hidden'}).get_text()
43         print('The user name is:%s' % name)
44         a = div.find(name='a', attrs={'class': 'u-photo d-block tooltipped tooltipped-s'})
45         img = a.find(name='img', attrs={'class': 'avatar width-full rounded-2'}).get('src')
46         print('The user photo address is:%s' % img)
47     except:
48         print('sorry,cannot get the information')
49 
50 
51 if __name__ == '__main__':
52     user_name = input('Please input your login name:').strip()
53     user_password = input('Please input your password:').strip()
54     get_info(user_name, user_password)

1．爬虫本质，通过代码伪造浏览器发送请求；

2． Http请求伪造像不像：

——请求头：

—— user-agent：代指用户使用的是什么设备访问

—— cookie：在用户浏览器上保存的标记

——请求体：

—— name=xxxx&age=18

—— {‘name’:’xxxx’,’age’:18}

拿不到数据，要么请求头有问题，要么请求体有问题

3. 分析Http请求

—— 先看XHR（ajax请求）

—— 查看是Get还是Post请求

—— Post请求要看请求数据格式，注意找Token、Cookies

—— 可以用正则，或者BS4 selector解析想要的数据

相关阅读:
第十七讲：解释器模式
第十六讲：适配器模式
第十五讲：桥接模式
第十四讲：组合模式
第十三讲：外观模式
第十二讲：代理模式
第十一讲：享元模式
第九讲：策略模式
工厂模式
观察者设计模式

原文地址：https://www.cnblogs.com/shajing/p/9265846.html