• 爬基基础


    爬虫

    网络爬虫(又称网页蜘蛛,网络机器人), 是一种按照一定规则,自动的抓取万维网信息的程序或者脚本。

    http与服务器交互的方法:

    get  仅仅获取资源的信息,不增加或者修改数据

    post 一般放到该服务器上的资源,一般通过form表单进行提交请求

    put  增加

    delete 删除

    Requests模块  安装 pip  install requests

    import requests

    1、get方式

    import requests
    params = {'key1':'hello','key2':'world'}
    url = 'http://www.baidu.com'
    r = requests.get(url=url,params=params)
    print(r.url)

    运行结果:

    http://www.baidu.com/?key1=hello&key2=world

    2、post方式

    import requests
    params = {'key1':'hello','key2':'world'}
    params = {'key1':'hello','key2':'world'}
    r = requests.post("http://httpbin.org/post",data = params)
    print(r.text)

    运行结果:

    {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {
        "key1": "hello", 
        "key2": "world"
      }, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Connection": "close", 
        "Content-Length": "21", 
        "Content-Type": "application/x-www-form-urlencoded", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.18.4"
      }, 
      "json": null, 
      "origin": "113.116.146.147", 
      "url": "http://httpbin.org/post"

    3、相应http的请求

    import  requests
    url = "http://qiushibaike.com/"
    r = requests.get(url=url)
    print(r.encoding)
    print(type(r.text))
    print(type(r.content))

    运行结果:

    UTF-8
    <class 'str'>
    <class 'bytes'>

    Requests 中text和centent的区别是什么

    r.text 返回str类型数据  可用于获取文本类型数据

    r.content 返回bytes型  可用于获取图片,文件

    4、其他常用方法

    import  requests
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36'}
    r = requests.get('https://www.qiushibaike.com/', headers=header)
    # print(r.text)
    print(r.request)
    print(r.headers)
    print(r.cookies)
    print(r.url)
    print(r.status_code)

    运行结果

    <PreparedRequest [GET]>
    {'Server': 'openresty', 'Date': 'Sun, 21 Jan 2018 01:08:11 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '18094', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Set-Cookie': '_xsrf=2|acc1cc58|fb495aec5628f018bc13a85be6a76a81|1516496891; Path=/', 'Vary': 'User-Agent, Accept-Encoding', 'Etag': '"560c073021ccc9e765bb6f4e4b4182594d4664ec"'}
    <RequestsCookieJar[<Cookie _xsrf=2|acc1cc58|fb495aec5628f018bc13a85be6a76a81|1516496891 for www.qiushibaike.com/>]>
    https://www.qiushibaike.com/
    200

    Request的会话对象

    Python3   s = requests.session()

    Python2   S = requests.Session()

    所有一次会话的信息都保存在s中,只需要对s进行操作就可以了。

    s.get(url)

  • 相关阅读:
    Java面試題(实用性高)
    索引的概述?
    给Eclipse提速的7个技巧
    ETL数据仓库
    实用SQL语句大全
    考证
    PL/SQL 程序
    eclipse
    httpd.conf详解,因为php始终报fileinfo扩展无法加载的错
    dockerfile创建镜像及容器
  • 原文地址:https://www.cnblogs.com/pythonlx/p/8323461.html
Copyright © 2020-2023  润新知