• Python学习笔记(四十九)爬虫的自我修养(一)


    论一只爬虫的自我修养

    • URL的一般格式(带括号[]的为可选项):

      protocol://hostname[:port]/path/[;parameters][?query]#fragment

    • URL由三部分组成:

      • 第一部分是协议: http、https、ftp、file、ed2k....

      • 第二部分是 存放资源的服务器的域名系统或IP地址 (有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)

      • 第三部分是资源的具体地址,如目录或文件名等

    import urllib.request
    response = urllib.request.urlopen("http://www.fishc.com")
    html = response.read()
    html = html.decode('utf-8')
    print(html)

    二、从网站上下载图片

    import urllib.request
    
    # req = urllib.request.Request('http://placekitten.com/g/600/600')
    # response = urllib.request.urlopen(req)
    response = urllib.request.urlopen('http://placekitten.com/g/500/600')
    cat_img = response.read()
    
    with open('cat_500_600.jpg', 'wb') as f:
        f.write(cat_img)
        
    response.geturl()
    print(response.info())
    response.getcode()
    • Get 从服务器请求获得数据

    • POST 向服务器提供数据

    import urllib
    from urllib import request
    
    content = input('请输入要翻译的内容: ')
    
    url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc'
    data = {}
    data['i'] = content
    data['type'] = 'AUTO'
    data['doctype'] = 'json'
    data['xmlVersion'] = '1.8'
    data['keyfrom'] = 'fanyi.web'
    data['ue'] = 'UTF-8'
    data['action'] = 'FY_BY_ENTER',
    data['typoResult'] = 'true'
    data = urllib.parse.urlencode(data).encode('utf-8')
    
    response = urllib.request.urlopen(url, data)
    html = response.read().decode('utf-8')
    
    print(html)
    
    """
    target = json.loads(html)
    print('翻译结果: %s' % (target['translateResult'][0][0]['tgt']))
    """
    • 隐藏

    # 隐藏,伪装成浏览器
    import json
    import urllib
    from urllib import request
    
    content = input('请输入要翻译的内容: ')
    url = 'http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc'
    
    head = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'
    }
    
    data = {}
    data['i'] = content
    data['type'] = 'AUTO'
    data['doctype'] = 'json'
    data['xmlVersion'] = '1.8'
    data['keyfrom'] = 'fanyi.web'
    data['ue'] = 'UTF-8'
    data['typoResult'] = 'true'
    data = urllib.parse.urlencode(data).encode('utf-8')
    
    req = urllib.request.Request(url, data, head)
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    # req = urllib.request.Request(url, data)    # 可以改成这两行代码
    # req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36')
    print(html) target = json.loads(html) target = target['translateResult'][0][0]['tgt'] print('翻译结果: %s' % (target))
    • 代理

    步骤:
    1. 参数是一个字典{'类型' : '代理ip : 端口号 '}     (类型: http, ftp等)

      proxy_support = urllib.request.ProxyHandler({})

    2. 定制、创建一个opener

      opener = urllib.request.build_opener(proxy_support)

    3. 安装 opener (urlopen以后就自动使用定制好的opener)

      urllib.request.install_opener(opener)

    4. 调用 opener

      opener.open(url)

    import urllib.request
    ​
    import random
    ​
    url = 'http://www.whatismyip.com.tw'
    ​
    iplist = ['121.40.199.105:80', '121.40.213.161:80', '121.196.226.246:84', '182.89.185.242:80']
    ​
    # 参数是一个字典{'类型' :  '代理ip : 端口号; '}    (类型: http, ftp等)
    # 随机选择 ip地址
    proxy_support = urllib.request.ProxyHandler({'http': random.choice(iplist)})
    ​
    # 定制、创建一个opener
    opener = urllib.request.build_opener(proxy_support)
    # 模拟一下浏览器
    opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36')]
    ​
    # 安装 opener (urlopen以后就自动使用定制好的opener)
    urllib.request.install_opener(opener)
    ​
    # 打开 url
    response = urllib.request.urlopen(url)
    ​
    # 解码 html
    html = response.read().decode('utf-8')
    ​
    print(html)

    结果:

    <!DOCTYPE HTML>
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
        <meta name="viewport" content="width=device-width,initial-scale=1.0">
        <meta name="description" content="查我的IP,查IP國家,查代理IP及真實IP"/>
        <meta name="keywords" content="查ip,ip查詢,查我的ip,我的ip位址,我的ip位置,我的ip國家,偵測我的ip,查詢我的ip,查看我的ip,顯示我的ip,what is my IP,whatismyip,my IP address,my IP proxy"/>
        <link rel="icon" href="data:;base64,iVBORw0KGgo=">
        <title>我的IP位址查詢</title>
      </head>
      <body>
    <h1>IP位址</h1> <span data-ip='121.196.226.246'><b style='font-size: 1.5em;'>121.196.226.246</b></span> <span data-ip-country='CN'><i>CN</i></span>
    
    <script type="application/json" id="ip-json">
    {
        "ip": "121.196.226.246",
        "ip-country": "CN",
        "ip-real": "",
        "ip-real-country": ""
    }
    </script>
    
    <script type="text/javascript">
    var sc_project=6392240;
    var sc_invisible=1;
    var sc_security="65d86b9d";
    var sc_https=1;
    var sc_remove_link=1;
    var scJsHost = (("https:" == document.location.protocol) ? "https://secure." : "http://www.");
    
    var _scjs = document.createElement("script");
    _scjs.async = true;
    _scjs.type = "text/javascript";
    _scjs.src = scJsHost + "statcounter.com/counter/counter.js";
    var _scnode = document.getElementsByTagName("script")[0];
    _scnode.parentNode.insertBefore(_scjs, _scnode);
    </script>
    <noscript><div class="statcounter"><img class="statcounter" src="http://c.statcounter.com/6392240/0/65d86b9d/1/" alt="statcounter"></div></noscript>
    
      </body>
    </html>

     

  • 相关阅读:
    Hbase王国游记之:Hbase客户端API初体验
    Hbase给初学者的“下马威”
    五分钟轻松了解Hbase面向列的存储
    JsonBuilder初出茅庐
    如何查看laravel门脸类包含方法的源码
    PHP常用数组函数
    Go语言strings包
    PHP获取远程http或ftp文件的md5值
    Git使用详细教程(9):git log
    PHP Iterator迭代对象属性
  • 原文地址:https://www.cnblogs.com/douzujun/p/7535808.html
Copyright © 2020-2023  润新知