• python爬虫简介


    一、什么是网络爬虫?

      网络爬虫,是一种按照一定规则,自动的抓取万维网信息的程序或者脚本。

    二、python网络爬虫,

      需要用到的第三方包 requests和BeautifulSoup4

      pip install requests

      pip install BeautifulSoup4 

      常用方法总结:

    response = requests.get('URL') #获取网
    response.text     #文本内容(字符串
    response.content  #文件内容,比如图
    response.encoding  #设置编
    response.aperant_encoding  #显示下载时候的编
    response.status_code #状态码
    response.cookies.get_dict()
    requests.get('http://www.autohome.com.cn/news/',cookie={'xx':'xxx'})
      

      beautifulsoup4模块  

    soup = BeautifulSoup('htmlstr',features='html.parser')
    v1 = soup.find('div')
    v1 = soup.find(id = 'i1')
    v1 = soup.find('div',id = 'i1')
    
    v2 = soup.find_all('div')
    v2 = soup.find_all(id = 'i1')
    v2 = soup.find_all('div',id = 'i1')
    v1.text  #字符串
    v1.attr #属性
    #v2是个列表
    v2[0].attr
    

    三、初始demo

    import requests
    from bs4 import BeautifulSoup
    response = requests.get(url = 'https://www.autohome.com.cn/news/') #下载页面
    response.encoding = response.apparent_encoding
    soup = BeautifulSoup(response.text,features='html.parser') #创建Beautisoup对象
    target = soup.find(id='auto-channel-lazyload-article') #找到新闻栏
    #print(target)
    li_list = target.find_all('li')
    for i in li_list:
        a = i.find('a')
        if a:
            print(a.attrs.get('href'))
            txt = a.find('h3').text
            imagurl = a.find('img').attrs.get('src')
            print(imagurl)
    
            img_response = requests.get(url = 'https:'+imagurl)
            import uuid
            file_name = str(uuid.uuid4())+'.jpg'
            with open(file_name,"wb") as f:
                f.write(img_response.content)

                               

    四、抽屉登录并点赞

    '''
    抽屉小套路,用户认证的cookie不是登录用户密码返回的cookie
    而是第一次get返回的cookie,然后登陆的时候把这个cookie带过去进行授权操作
    '''
    import requests
    
    
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    post_data = {
        'phone':'8615191481351',
        'password':'11111111',
        'oneMonth':1
    }
    ret1 = requests.get(
        url = 'https://dig.chouti.com',
        headers = headers
    )
    cookie1 = ret1.cookies.get_dict()
    print(cookie1)
    
    ret2 = requests.post(
        url = 'https://dig.chouti.com/login',
        data = post_data,
        headers = headers,
        cookies = cookie1
    )
    cookie2 = ret2.cookies.get_dict()
    print(cookie2)
    
    ret3 = requests.post(
        url = 'https://dig.chouti.com/link/vote?linksId=21910661',
        cookies = {
            'gpsd':cookie1['gpsd']
            #'gpsd': 'f59363bb59b30fe7126b38756c6e5680'
        },
        headers = headers
    )
    print(ret3.text)
    
    ret = requests.post(
        url = 'https://dig.chouti.com/vote/cancel/vote.do',
        cookies = {
            'gpsd': cookie1['gpsd']
        },
        data = {'linksId': 21910661},
        headers = headers
    )
    print(ret.text)
    

      

    更多关于request参数的介绍:http://www.cnblogs.com/wupeiqi/articles/6283017.html

                     

  • 相关阅读:
    IDEA启动tomcat乱码
    RowKey设计之单调递增行键/时序数据
    ES入门REST API
    WebSocket-java实践
    Linux(CentOS 7)安装测试mysql5.6服务
    Linux(CentOS 7)安装测试svn服务
    Node.js安装及环境配置之Windows篇---完美,win7已测
    Java配置----JDK开发环境搭建及环境变量配置
    手把手教新手小白在window把自己的项目上传到github
    github常见操作和常见错误!错误提示:fatal: remote origin already exist
  • 原文地址:https://www.cnblogs.com/qiangayz/p/9563377.html
Copyright © 2020-2023  润新知