python爬虫简介

一、什么是网络爬虫？

　　网络爬虫，是一种按照一定规则，自动的抓取万维网信息的程序或者脚本。

二、python网络爬虫，

　　需要用到的第三方包 requests和BeautifulSoup4

　　pip install requests

　　pip install BeautifulSoup4

　　常用方法总结：

response = requests.get('URL') #获取网
response.text     #文本内容（字符串
response.content  #文件内容，比如图
response.encoding  #设置编
response.aperant_encoding  #显示下载时候的编
response.status_code #状态码
response.cookies.get_dict()
requests.get('http://www.autohome.com.cn/news/',cookie={'xx':'xxx'})

　　beautifulsoup4模块　　

soup = BeautifulSoup('htmlstr',features='html.parser')
v1 = soup.find('div')
v1 = soup.find(id = 'i1')
v1 = soup.find('div',id = 'i1')

v2 = soup.find_all('div')
v2 = soup.find_all(id = 'i1')
v2 = soup.find_all('div',id = 'i1')
v1.text  #字符串
v1.attr #属性
#v2是个列表
v2[0].attr

三、初始demo

import requests
from bs4 import BeautifulSoup
response = requests.get(url = 'https://www.autohome.com.cn/news/') #下载页面
response.encoding = response.apparent_encoding
soup = BeautifulSoup(response.text,features='html.parser') #创建Beautisoup对象
target = soup.find(id='auto-channel-lazyload-article') #找到新闻栏
#print(target)
li_list = target.find_all('li')
for i in li_list:
    a = i.find('a')
    if a:
        print(a.attrs.get('href'))
        txt = a.find('h3').text
        imagurl = a.find('img').attrs.get('src')
        print(imagurl)

        img_response = requests.get(url = 'https:'+imagurl)
        import uuid
        file_name = str(uuid.uuid4())+'.jpg'
        with open(file_name,"wb") as f:
            f.write(img_response.content)

四、抽屉登录并点赞

'''
抽屉小套路，用户认证的cookie不是登录用户密码返回的cookie
而是第一次get返回的cookie，然后登陆的时候把这个cookie带过去进行授权操作
'''
import requests


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
}
post_data = {
    'phone':'8615191481351',
    'password':'11111111',
    'oneMonth':1
}
ret1 = requests.get(
    url = 'https://dig.chouti.com',
    headers = headers
)
cookie1 = ret1.cookies.get_dict()
print(cookie1)

ret2 = requests.post(
    url = 'https://dig.chouti.com/login',
    data = post_data,
    headers = headers,
    cookies = cookie1
)
cookie2 = ret2.cookies.get_dict()
print(cookie2)

ret3 = requests.post(
    url = 'https://dig.chouti.com/link/vote?linksId=21910661',
    cookies = {
        'gpsd':cookie1['gpsd']
        #'gpsd': 'f59363bb59b30fe7126b38756c6e5680'
    },
    headers = headers
)
print(ret3.text)

ret = requests.post(
    url = 'https://dig.chouti.com/vote/cancel/vote.do',
    cookies = {
        'gpsd': cookie1['gpsd']
    },
    data = {'linksId': 21910661},
    headers = headers
)
print(ret.text)

更多关于request参数的介绍：http://www.cnblogs.com/wupeiqi/articles/6283017.html

相关阅读:
IDEA启动tomcat乱码
 RowKey设计之单调递增行键/时序数据
 ES入门REST API
WebSocket-java实践
 Linux（CentOS 7）安装测试mysql5.6服务
 Linux（CentOS 7）安装测试svn服务
 Node.js安装及环境配置之Windows篇---完美，win7已测
 Java配置----JDK开发环境搭建及环境变量配置
 手把手教新手小白在window把自己的项目上传到github
github常见操作和常见错误！错误提示：fatal: remote origin already exist
原文地址：https://www.cnblogs.com/qiangayz/p/9563377.html