关于豆瓣,还是算是爬虫友好型的网站,之前模拟登陆也很容易就成功了,不过最近要在豆瓣抓点东西,发现代码已经不能用了。打印源码发现,需要验证码了。
所以,这里写个续集。。。较上一篇改动主要在验证码和一个随机字符串的获取,再之后加入pyload就行了。具体参照代码。
import re import requests headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36', 'Referer':'https://www.douban.com/accounts/login?source=movie'} s = requests.Session() # 获取验证码 imgdata = s.get("https://www.douban.com/accounts/login?source=movie", headers=headers, verify=False).text print(imgdata) pa = re.compile(r'<img id="captcha_image" src="(.*?)" alt="captcha" class="captcha_image"/>') img_url = re.findall(pa, imgdata)[0] # print(img_url) picdata = s.get(img_url).content with open("douban.jpg", 'wb') as f: f.write(picdata) pa_id = re.compile(r'<input type="hidden" name="captcha-id" value="(.*?)"/>') capid = re.findall(pa_id, imgdata)[0] print(capid) capimg = input("输入验证码:") payload = { "source":"movie", "redir":"https://movie.douban.com/", "form_email":"你的邮箱", "form_password":"你的密码", "captcha-solution":capimg, "captcha-id":capid, "login":"登录" } log_url = "https://accounts.douban.com/login" data1 = s.post(log_url, data=payload, verify=False) # 绕过了SSL验证 print(data1.status_code) data2 = s.get('https://www.douban.com/people/146448257/') print(data2.status_code) print(data2.text)
大概就这样,今天先写到这了,天快明了。。2333