• The website is API(1)


    Requests 自动爬取HTML页面 自动网路请求提交

    robots 网络爬虫排除标准

    Beautiful Soup 解析HTML页面

    实战

    Re 正则表达式详解提取页面关键信息

    Scrapy*框架

    第一周:规则

    第一单元:Requests库入门

    1.安装

    以管理员身份运行命令提示符

    输入 pip install request

    验证:

    >>> import requests
    >>> r = requests.get("http://www.baidu.com")
    >>> r.status_code
    200

    requests.request():构造一个请求,支撑以各个方法的基础方法

    requests.get():获取HTML网页的主要方法,对应于HTTP的GET

    requests.get(url,params=None,**kwargs)

    url:拟获取页面的url链接

    params:url中的额外参数,字典或字节流格式,可选

    **kwargs:12个控制访问的参数

    Response对象的属性

    r.status_code:HTTP请求的返回状态,200表示连接成功,404表示失败

    r.text:HTTP响应内容的字符串形式,即,url对应的页面内容

    r.encoding:从HTTP header中猜测的响应内容编码方式

    r.apparent_encoding:从内容中分析出响应内容编码方式

    r.content:HTTP响应内容的二进制形式

    通用代码框架:

    >>> import requests
    >>> def getHTMLText(url):
        try:
            r = requests.get(url,timeout=30)
            r.raise_for_status()#如果状态不是200,引发HTTPEorror异常
            r.encoding = r.apparent_encoding
            return r.text
        except:
            return "产生异常"

      >>> if __name__ == "__main__":
                url="www.baidu.com"
                print(getHTMLText(url))

    
    


    产生异常

     

    requests.head():网页头,HEAD

    requests.post():向HTML网页提交POST请求的方法,POST

    requests.put():PUT

    requests.patch():局部修改请求,PATCH

    requests.delete():删除请求,DELETE

     requests.request(method,url,**kwargs)

    method:请求方式,对应get/put/post等七种

    r = requests.request('GET',url,**kwargs)

    r = requests.request('HEAD',url,**kwargs)

    r = requests.request('POST',url,**kwargs)

    r = requests.request('PUT',url,**kwargs)

    r = requests.request('PATCH',url,**kwargs)

    r = requests.request('delete',url,**kwargs)

    r = requests.request('OPTIONS',url,**kwargs)

     **kwargs:控制访问的参数,可选

    params:字典或字节序列,作为参数增加到url中

    data:字典、字节序列或文件对象,作为Request的内容

    json:JSON格式的数据

    headers:

    https://www.baidu.com/robots.txt

    Requests库爬取实例

    >>> import requests
    >>> url = "https://item.jd.com/2967929.html"
    >>> try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[:1000])
    except:
        print("爬取失败")
    
        
    <!DOCTYPE HTML>
    <html lang="zh-CN">
    <head>
        <!-- shouji -->
        <meta http-equiv="Content-Type" content="text/html; charset=gbk" />
        <title>【华为荣耀8】荣耀8 4GB+64GB 全网通4G手机 魅海蓝【行情 报价 价格 评测】-京东</title>
        <meta name="keywords" content="HUAWEI荣耀8,华为荣耀8,华为荣耀8报价,HUAWEI荣耀8报价"/>
        <meta name="description" content="【华为荣耀8】京东JD.COM提供华为荣耀8正品行货,并包括HUAWEI荣耀8网购指南,以及华为荣耀8图片、荣耀8参数、荣耀8评论、荣耀8心得、荣耀8技巧等信息,网购华为荣耀8上京东,放心又轻松" />
        <meta name="format-detection" content="telephone=no">
        <meta http-equiv="mobile-agent" content="format=xhtml; url=//item.m.jd.com/product/2967929.html">
        <meta http-equiv="mobile-agent" content="format=html5; url=//item.m.jd.com/product/2967929.html">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <link rel="canonical" href="//item.jd.com/2967929.html"/>
            <link rel="dns-prefetch" href="//misc.360buyimg.com"/>
        <link rel="dns-prefetch" href="//static.360buyimg.com"/>
        <link rel="dns-prefetch" href="//img10.360buyimg.com"/>
        <link rel="dns
    >>> import requests
    >>> url = "https://www.amazon.cn/gp/product/B01MBL5Z3Y"
    >>> try:
        kv = {'user-agent':'Mozilla/5.0'}
        r = requests.get(url,headers = kv)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[1000:2000])
    except:
        print("Fail")
    
        
           ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
            ue_sn = "opfcaptcha.amazon.cn",
            ue_id = 'HB12BAYVB85FMA4VRS38';
    }
    </script>
    </head>
    <body>
    
    <!--
            To discuss automated access to Amazon data please contact api-services-support@amazon.com.
            For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com.cn/index.html/ref=rm_c_sv, or our Product Advertising API at https://associates.amazon.cn/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
    -->
    
    <!--
    Correios.DoNotSend
    -->
    
    <div class="a-container a-padding-double-large" style="min-350px;padding:44px 0 !important">
    
        <div class="a-row a-spacing-double-large" style=" 350px; margin: 0 auto">
    
            <div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
    
            <div class="a-box a-alert a-alert-info a-spacing-base">
                <div class="a-box-inner">

     百度360搜索关键词提交

    import requests
    keyword = 'Python'
    try:
        kv = {'q':keyword}
        r = requests.get("http://www.so.com/s",params = kv)
        print(r.request.url)
        r.raise_for_status()
        print(len(r.text))
    except:
        print("爬取失败")

    图片下载

    import requests
    import os
    url = "http://wx1.sinaimg.cn/mw600/0076BSS5ly1g6hmmj82tpj30u018wdos.jpg"
    root = "E://pics//"
    path = root + url.split('/')[-1]
    try:
        if not os.path.exists(root):
            os.mkdir(root)
        if not os.path.exists(path):
            r = requests.get(url)
            with open(path,'wb') as f:
                f.write(r.content)
                f.close()
                print("文件保存成功")
        else:
            print("文件已存在")
    except:
        print("爬取失败")

    IP地址查询

    import requests
    url = "http://m.ip138.com/ip.asp?ip="
    try:
        r = requests.get(url+'202.204.80.112')
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        print(r.text[-300:])
    except:
        print("爬取失败")
  • 相关阅读:
    稳扎稳打Silverlight(47) 4.0UI之操作剪切板, 隐式样式, CompositeTransform, 拖放外部文件到程序中
    返璞归真 asp.net mvc (9) asp.net mvc 3.0 新特性之 View(Razor)
    返璞归真 asp.net mvc (6) asp.net mvc 2.0 新特性
    稳扎稳打Silverlight(48) 4.0其它之打印, 动态绑定, 增强的导航系统, 杂七杂八
    精进不休 .NET 4.0 (9) ADO.NET Entity Framework 4.1 之 Code First
    稳扎稳打Silverlight(42) 4.0控件之Viewbox, RichTextBox
    稳扎稳打Silverlight(53) 4.0通信之对WCF NetTcpBinding的支持, 在Socket通信中通过HTTP检索策略文件, HTTP请求中的ClientHttp和BrowserHttp
    稳扎稳打 Silverlight 4.0 系列文章索引
    稳扎稳打Silverlight(54) 4.0通信之对UDP协议的支持: 通过 UdpAnySourceMulticastClient 实现 ASM(Any Source Multicast),即“任意源多播”
    返璞归真 asp.net mvc (8) asp.net mvc 3.0 新特性之 Model
  • 原文地址:https://www.cnblogs.com/kmxojer/p/11260085.html
Copyright © 2020-2023  润新知