• Python 爬虫知识点


    一、基础知识

    1、HTML分析

    2、urllib爬取

    导入urilib包(Python3.5.2)

    3、urllib保存网页

    import urllib.request
    url = "http://www.cnblogs.com/wj204/p/6151070.html"
    html = urllib.request.urlopen(url).read()
    fh=open("F:/20_Python/3000_Data/2.html","wb")
    fh.write(html)
    fh.close()

    4、模拟浏览器


    import urllib.request
    url="http://www.cnblogs.com/"
    headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
    opener=urllib.request.build_opener()
    opener.addheaders=[headers]
    data=opener.open(url).read()
    fh=open("F:/20_Python/3000_Data/1.html","wb")
    fh.write(data)
    fh.close()

    5、urllib保存图片

     使用  http://www.bejson.com/  查看存储在JS中的Json数据g_page_config

    import re
    import urllib.request
    keyWord = "Python机器学习"
    keyWord2 = urllib.request.quote(keyWord)
    headers = ("User-Agent","MMozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.1708.400 QQBrowser/9.5.9635.400")
    opener = urllib.request.build_opener()
    opener.addheaders = [headers]
    urllib.request.install_opener(opener)
    url = "https://s.taobao.com/search?q=" + keyWord2 + "&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20161214"
    data = urllib.request.urlopen(url).read().decode("utf-8","ignore")
    pat = 'pic_url":"//(.*?)"'#注意,该数据不在Html代码之中,在全局脚本g_page_config
    imageList = re.compile(pat).findall(data)
    for j in range(0,len(imageList)):
    try:
    curImage = imageList[j]
    curImageUrl = "http://" + curImage
    file="F:/20_Python/3000_Data/" + str(j) + ".jpg"
    print(file)
    urllib.request.urlretrieve(curImageUrl,filename=file)
    except urllib.error.URLError as e:
    if hasattr(e,"code"):
    print(e.code)
    if hasattr(e,"reason"):
    print(e.reason)
    except Exception as e:
    print(e)

    6、正则表达式

     常用正则表达式爬取网页信息及分析HTML标签总结 http://blog.csdn.net/eastmount/article/details/51082253

     如对Python机器学习的正则分析:

    pat = 'pic_url":"//(.*?)"'
    re.compile(pat).findall(data)

    提取(.*?),位于pic_url":"//和"之中

    
    

    如对糗事百科的正则分析:

    pat='<div class="content">.*?<span>(.*?)</span>.*?</div>'
    datalist=re.compile(pat,re.S).findall(pagedata)

    
    

    7、IP代理

     需要靠谱稳定的IP地址,找到合适的代理替换 proxy_addr

    import urllib.request
    import random
    def use_proxy(url,proxy_addr):
    proxy=urllib.request.ProxyHandler({"http":random.choice(proxy_addr)})
    headers = ("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
    opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    opener.addheaders = [headers]
    urllib.request.install_opener(opener)
    data=urllib.request.urlopen(url).read().decode("utf-8","ignore")
    return data
    proxy_addr=["45.64.166.142:8080","80.1.116.80:80","196.15.141.27:8080","47.88.6.158:8118","125.209.97.190 :8080"]
    url="http://cuiqingcai.com/1319.html" #http://proxy.com.ru
    data=use_proxy(url,proxy_addr)
    print(len(data))

    8、抓包分析

    9、多线程爬取

    import threading

    class DownPage(threading.Thread):
    def __init__(self):
    threading.Thread.__init__(self)
    def run(self):
    print("处理下载业务业务")

    downTask = DownPage()
    downTask.start()


    10、异常处理

     见:urllib保存图片,使用try:except:捕获异常

    11、XPath

     http://www.cnblogs.com/defineconst/p/6181333.html

    二、Scrapy安装关联包

     PyCharm---》File---》Settings---》Project..........

  • 相关阅读:
    dig命令不能使用(-bash: dig: command not found)
    linux系统中的一些典型问题汇总
    Django运行项目时候出现DisallowedHost at / Invalid HTTP_HOST header:
    Grafana添加Zabbix为数据源(二)
    Grafana添加Zabbix为数据源(一)
    linux go环境安装
    centos6里面装zabbix(五)
    centos6里面装zabbix(二)
    HTTP状态码分类及异常状态码处理
    超详细 Linux 下编译安装Redis 以及php配套使用
  • 原文地址:https://www.cnblogs.com/defineconst/p/6158538.html
Copyright © 2020-2023  润新知