• 一:requests爬虫基础


    一,什么是爬虫?

    描述: 本质是一个自动化程序,一个模拟浏览器向某一个服务器发送请求获取响应资源的过程.

      爬虫的基本流程

      robots.txt协议

    编写一个robots.txt的协议文件来约束爬虫程序的数据爬取。

    二,http协议

    import  requests
    '''1,GET:GET可以说是最常见的了,它本质就是
    发送一个请求来取得服务器上的某一资源。资源通过
    一组HTTP头和呈现据(如HTML文本,或者图片或者视频等)
    返回给客户端。GET请求中,永远不会包含呈现数据。'''
    res= requests.get("http://httpbin.org/get")
    print("get请求>>>>res:",res.text)
    
    '''2,POST请求同PUT请求类似,都是向服务器端发送数据的,但是该请求会改变数据的种类等资源,
    就像数据库的insert操作一样,会创建新的内容。几乎目前所有的提交操作都是用POST请求的。有请求体'''
    res1=requests.post("http://httpbin.org/post")
    print("post请求>>>>res1",res1.text)
    
    '''3、PUT请求是向服务器端发送数据的,从而改变信息,该请求就像数据库
    的update操作一样,用来修改数据的内容,但是不会增加数据的种类等,也就
    是说无论进行多少次PUT操作,其结果并没有不同。'''
    res2=requests.put("http://httpbin.org/put")
    print("put请求>>>>res2:",res2.text)
    
    '''4删除请求'''
    res3=requests.delete("http://httpbin.org/delete")
    print("delete删除请求>>>>res3:",res3.text)
    
    '''5,HEAD请求常常被忽略,但是能提供很多有用的信息,特别是在有限的速度和带宽下。主要有以下特点:
    1、只请求资源的首部;
    2、检查超链接的有效性;
    3、检查网页是否被修改;
    4、多用于自动搜索机器人获取网页的标志信息,获取rss种子信息,或者传递安全认证信息等'''
    res4=requests.head("http://httpbin.org/get")
    print(">>>>res4:",res4.text)
    
    '''6,options请求是用于请求服务器对于某些接口等资源的支持情况的,包括各种请求方法、
    头部的支持情况,仅作查询使用。来个栗子'''
    res5=requests.options("http://httpbin.org/get")
    print(">>>>res5:",res5.text)
    http请求方式和区别
    D:python3.6python.exe F:/爬虫/request请求方式.py
    get请求>>>>res: {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "origin": "61.144.173.127, 61.144.173.127", 
      "url": "https://httpbin.org/get"
    }
    
    post请求>>>>res1 {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Content-Length": "0", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "json": null, 
      "origin": "61.144.173.127, 61.144.173.127", 
      "url": "https://httpbin.org/post"
    }
    
    put请求>>>>res2: {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Content-Length": "0", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "json": null, 
      "origin": "61.144.173.127, 61.144.173.127", 
      "url": "https://httpbin.org/put"
    }
    
    delete删除请求>>>>res3: {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.21.0"
      }, 
      "json": null, 
      "origin": "61.144.173.127, 61.144.173.127", 
      "url": "https://httpbin.org/delete"
    }
    
    >>>>res4: 
    >>>>res5:
    http请求方式和区别.py的打印结果

    三,requests的属性和方法

    requests官方文档

      requests的下载 :   pip install requests

    import requests
    ########一 get基本请求请求
    # 案例1:爬取京东首页
    # url ="https://www.jd.com/"
    # res1=requests.get(url)
    # with open("jd.html","w",encoding="utf-8") as f:
    #     f.write(res1.text)
    ########二 含参数,请求头
    # 案例1:爬取百度搜索首页
    # url2 ="https://image.baidu.com/"
    # res2=requests.get(url2,params={
    #             "wd":"刘传盛"
    #             },
    #              headers={
    #               "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    #             }
    # )
    # with open("baidu.html","wb") as f:
    #     f.write(res2.content)
    #案例2 爬取抽屉网
    # url="https://dig.chouti.com/" #该网站做ua反扒所以需携带User-Agent参数
    # res=requests.get(url,
    #                  headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    #                           })
    # with open("chouti.html","wb") as f:
    #     f.write(res.content)
    ########三 cookie参数
    # import uuid  #随机生成uuid字符串
    # import requests
    # url = 'http://httpbin.org/cookies'
    # cookies={"sbid":str(uuid.uuid4()),"a":"1"}
    # res=requests.get(url,cookies=cookies)
    # print(res.text)
    ########三 session对象
    # res=requests.post("/login/")
    # dic={}
    # requests.get("/index/",cookies=dic)
    #
    # session=requests.session()
    # session.post("/login/")
    # session.get("/index/")
    
    ########四 post请求
    # res1=requests.post(url="http://httpbin.org/post?a=1",data={
    #     "name":"yuan"
    # })
    # print(res1.text)
    #
    # res2=requests.post(url="http://httpbin.org/post?a=1",data={
    #     "name":"alex"
    # })
    # print(res2.text)
    ########五 IP代理
    res=requests.get('http://httpbin.org/ip', proxies={'http':'111.177.177.87:9999'}).json()
    print(res)
    requests的属性

    四,resquests的爬取数据的方法

    import requests
    ######## 一 content text
    # response=requests.get("http://www.autohome.com.cn/beijing/")
    #爬取文件方式1
    # print(response.content) #打印的是一堆字节
    # print(response.encoding) #打印爬取数据所用的编码
    # print(response.text)  #打印文本文件
    # response.encoding="gbk"
    # with  open("autohome.html","wb") as f:
    #     f.write(response.content)
    #爬取文件方式2
    # with open("autohome.html","wb") as f:
    #      f.write(response.content)
    ######## 二 爬取图片,音频视频
    # res=requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1551598670426&di=cd0b0fe51a124afed16efad2269215ae&imgtype=0&src=http%3A%2F%2Fn.sinaimg.cn%2Fsinacn22%2F23%2Fw504h319%2F20180819%2Fb69e-hhxaafy7949630.jpg")
    # with open("鞠.jpg","wb") as f :
    #      f.write(res.content)
    #
    # res1=requests.get("http://y.syasn.com/p/p95.mp4")
    # with open("xiao.mp4","wb") as f:
    #     for line in  res1.iter_content():
    #         f.write(line)
    ######## 三 响应json数据
    # res=requests.get("http://httpbin.org/get")
    # print(res.text)
    # print(type(res.text))  #打印结果 <class 'str'>
    # import json
    # print(json.loads(res.text))
    # '''打印结果 {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding':
    #  'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.21.0'},
    #   'origin': '61.144.173.127, 61.144.173.127', 'url': 'https://httpbin.org/get'}'''
    # print(type(json.loads(res.text)))   #打印结果<class 'dict'>
    # print("-----------")
    # print(res.json())
    # print(type(res.json()))
    ######## 四 重定向
    # res=requests.get("http://www.jd.com/")
    # print(res.history) #[<Response [302]>]
    # print(res.text)
    # print(res.status_code) #200
    # res=requests.get("http://www.jd.com/" ,allow_redirects=False)#allow_redirects不能重定向
    # print(res.history) #[]
    # print(res.status_code) #302
    resquests的爬取数据的方法

    五案例爬取githu的首页

    #该网站做了反扒措施,不能直接爬取首页,需先爬取login登录页面,获取token值
    import requests
    import re
    
    session=requests.session()
    
    # login请求:目的获取动态token值 
    res1=session.get("https://github.com/login")
    token=re.findall('<input type="hidden" name="authenticity_token" value="(.*?)" />',res1.text,re.S)[0]
    print(token)
    
    
    res2=session.post("https://github.com/session",data={
            "commit": "Sign in",
            "utf8":"",
            "authenticity_token": token,
            "login": "yuanchenqi0316@163.com",
            "password": "yuanchenqi0316"
    })
    
    # res=requests.get("https://github.com/settings/emails")
    
    with open("github.html","wb") as f:
        f.write(res2.content)
    
    print(res2.history)
    github案例.py

     六重点:请求格式

    请求协议格式:
              请求首行
              请求头
              空行
              请求体
    
        请求头:
              content-type
              
              浏览器------------------>服务器          
              1 针对post请求(post请求才有请求体)
              2 向服务器发送post请求有哪些形式:
                    form表单 (urlencoded编码格式)
                        user=yuan
                        pwd=123                    
                    Ajax(urlencoded编码格式)
                         a=1
                         b=2
             
                  请求协议格式字符串
                  发送urlencoded编码格式数据
                  '''
                      请求首行
                      请求头
                      content-type:"urlencoded"
                      空行
                      请求体 #  user=yuan&pwd=123   urlencoded编码格式
                  '''
                  发送json数据
                   '''
                      请求首行
                      请求头
                      content-type:"json"
                      空行
                      请求体 #  {"user":"yuan","pwd":123}   json编码格式
    content-type
  • 相关阅读:
    学Java,上这个网站就够了
    父类如果不写无参构造方法,子类会报错
    Java中instanceof的用法
    Spring,SpringMVC,Mybatis等配置文件报错解决(Referenced file contains errors)
    初识Java
    JavaSE课程知识体系总结
    Java中awt和swing的关系和区别
    ServiceMesh实验室——00之实验室搭建
    Microservices==>Service Mesh==>Serverless,走马观花
    领域驱动(DDD)之我见,基于Golang实现
  • 原文地址:https://www.cnblogs.com/liucsxiaoxiaobai/p/10465585.html
Copyright © 2020-2023  润新知