• 图片懒加载和UA池,UA代理池


    1,动态数据加载的处理

    • 图片懒加载概念:
      • 图片懒加载是一种页面优化技术.图片作为一种网络资源,在被请求时也与静态资源一样,将占用网络资源,而一次性将整个页面的所有图片加载完,将大大增加页面首屏加载时间,为了解决这些问题,通过前后端配合,是图片仅在浏览器当前窗口出现时才加载给图片,达到减少首屏图片请求数的技术叫做"图片懒加载"
    • 网站一般如何实现图片懒记载技术?
      • 在网页源码中,img标签中 首先会使用一个"伪属性"(通常使用src2,original......)去存放真正的图片连接,而并非是直接存放在src属性中,当图片出现到页面的可视化区域中,会动态将伪属性替换成src属性,完成图片的记载

    2,selenium

    • 什么是selenium:是Python的一个第三方库,对外提供的接口可以操作浏览器,然后让浏览器完成自动化的操作
    • 环境搭建:
      • 安装selenium: pip install selenium
      • 获取某一款浏览器的驱动程序(以谷歌浏览器为例)
      • 谷歌浏览器下载驱动地址: http://chromedriver.storage.googleapis.com/index.html
      • 下载驱动程序必须和浏览器的版本统一,大家可以根据以下版本对照下载:
      • http://blog.csdn.net/huilan_same/article/details/51896672
    • 效果展示:
    from selenium import webdriver
    from time import sleep
    
    # 后面是你的浏览器驱动位置,记得前面加r'','r'是防止字符转义的
    driver = webdriver.Chrome(r'驱动程序路径')
    # 用get打开百度页面
    driver.get("http://www.baidu.com")
    # 查找页面的“设置”选项,并进行点击
    driver.find_elements_by_link_text('设置')[0].click()
    sleep(2)
    # # 打开设置后找到“搜索设置”选项,设置为每页显示50条
    driver.find_elements_by_link_text('搜索设置')[0].click()
    sleep(2)
    
    # 选中每页显示50条
    m = driver.find_element_by_id('nr')
    sleep(2)
    m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()
    m.find_element_by_xpath('.//option[3]').click()
    sleep(2)
    
    # 点击保存设置
    driver.find_elements_by_class_name("prefpanelgo")[0].click()
    sleep(2)
    
    # 处理弹出的警告页面   确定accept() 和 取消dismiss()
    driver.switch_to_alert().accept()
    sleep(2)
    # 找到百度的输入框,并输入 美女
    driver.find_element_by_id('kw').send_keys('美女')
    sleep(2)
    # 点击搜索按钮
    driver.find_element_by_id('su').click()
    sleep(2)
    # 在打开的页面中找到“Selenium - 开源中国社区”,并打开这个页面
    driver.find_elements_by_link_text('美女_百度图片')[0].click()
    sleep(3)
    
    # 关闭浏览器
    driver.quit()

    代码详解:

    1. 导包: from selenium import webdriver
    2. 创建浏览器对象,通过该浏览器对象可以操作浏览器:browser = webdriver.Chrome("驱动路径")
    3. 使用浏览器发起指定请求
    4. browser.get(url)
    5. 使用下面方法,查找指定的元素进行操作
      1. find_element_by_id                       根据id中节点
      2. find_elements_by_name               根据name找
      3. find_elements_by_xpath               根据xpath查找
      4. find_elements_by_yag_name       根据标签名查找
      5. find_elements_by_calss_name     根据class名字查找

    3,phantomJS

    • phantomJS是一款无界面的浏览器,其自动化操作流程和上述操作胡歌浏览器是一致的,有于是无界面的,为了能够展示自动化操作流程,phantomJS为用户提供了一个截屏功能,使用save_screenshot实现
    from selenium import webdriver
    import time
    
    bro = webdriver.PhantomJS(executable_path="D:PhantomJSphantomjs-2.1.1-windowsinphantomjs.exe")
    # 请求的发送
    
    bro.get(url="https://www.baidu.com")
    # 截图
    bro.save_screenshot("./1.jpg")
    # 根据find系列的函数定位到指定的标签
    my_input = bro.find_element_by_id("kw")
    # 向标签中录入指定的标签
    
    my_input = bro.send_keys("美女")
    # 知道百度一下的按钮
    my_button = bro.find_element_by_id("su")
    my_button.click()
    
    # 获取浏览器当前的页面源码
    page_text = bro.page_source
    bro.save_screenshot("./2.png")  # 截图
    
    print(page_text)
    bro.quit()
    from selenium import webdriver
    import time
    # 伪装一个谷歌浏览器(实例化一个谷歌浏览器的对象)
    bro = webdriver.Chrome(executable_path=r"D:chromechromedriver.exe")
    
    # 发送的请求
    bro.get(url="https://www.baidu.com")
    time.sleep(3)
    # 根据find系列的函数定位到指定的标签(这个是输入框的标签)
    my_input = bro.find_element_by_id("kw")
    
    # 向标签中录入指定的数据
    my_input.send_keys("美女")
    time.sleep(3)
    # 获取到点击按钮(百度一下)
    my_button = bro.find_element_by_id("su")
    # 点击搜索
    my_button.click()
    time.sleep(3)
    
    # 获取到当前浏览器显示的页面的页面源码
    page_text = bro.page_source
    print(page_text)
    # 退出
    bro.quit()

    qq空间登录的代码:

    from lxml import etree
    bro = webdriver.Chrome(executable_path=r"D:chromechromedriver.exe")
    url = "https://qzone.qq.com/"
    
    # 请求的发送
    bro.get(url=url)
    time.sleep(1)
    # 定位到指定的iframe
    bro.switch_to.frame("login_frame")
    # 找到账号密码登录的标签
    bro.find_element_by_id("switcher_plogin").click()
    time.sleep(1)
    
    # 找到登录的按钮,并点击登录
    
    # 找到用户名的输入框并输入账号
    username = bro.find_element_by_id("u")
    username.send_keys("937371049")
    
    # 找到密码并输入
    password = bro.find_element_by_id("p")
    password.send_keys("13633233754")
    
    # 找到登录按钮点击登录
    bro.find_element_by_id("login_button").click()
    time.sleep(1)
    
    # 找到js代码,滚轮 向下滚动
    js = "window.scrollTo(0, document.body.scrollHeight)"
    
    # 滚轮开始滚动
    bro.execute_script(js)
    time.sleep(2)
    bro.execute_script(js)
    time.sleep(2)
    bro.execute_script(js)
    time.sleep(2)
    bro.execute_script(js)
    time.sleep(2)
    page_text = bro.page_source
    time.sleep(3)
    
    # 解析:
    # 把 获取到的数据转化成html的格式
    tree = etree.HTML(page_text)
    
    div_list = tree.xpath('//div[@class="f-info qz_info_cut"] | //div[@class="f-info"]')
    for div in div_list:
        text = div.xpath('.//text()')
        text = "".join(text)
        print(text)
    bro.quit()

    4,谷歌 无头浏览器

    • 由于phantomJS最近已经停止了更新和维护,所以大家使用谷歌无头浏览器
    from selenium.webdriver.chrome.options import Options
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--disable-gpu")
    
    # 谷歌无头浏览器
    
    bro = webdriver.Chrome(executable_path=r"D:chromechromedriver.exe", chrome_options=chrome_options)
    
    # 请求的发送
    bro.get(url="https://www.baidu.com")
    
    # 根据find系列的函数定位到指定的标签
    my_input = bro.find_element_by_id("kw")
    
    # 向标签中录入指定的元素
    my_input.send_keys("美女")
    
    my_button = bro.find_element_by_id("su")
    # 点击百度
    my_button.click()
    
    # 获取当前浏览器显示的页面的页面源码
    page_text = bro.page_source
    print(page_text)
    bro.quit()

    5,UA池 和代理池

    • 咱们先去官网把scrapy框架的图拿出来see,see

    • 下载中间件(Downloader Middlewares)位于引擎和下载器之间的一层组件
    • 作用:
      1. 引擎将请求传递给下载器过程中,下载中间件可以对请求进行一系列处理,比如设置请求的User-Agent,设置代理等
      2. 在下载器完成将Response传递给引擎中,下载中间件可以对响应进行一系列处理,比如进行gzip解压等
    • 我们主要使用下载中间件处理请求,一般会对请求,设置随机的User-Agent,设置随机的代理,目的在于防止爬取网站的反爬虫策略.

    UA池:User-Agent池

    • 应用:尽可能多的将scrapy工程中的请求伪装成不同类型额度浏览器身份
    • 操作流程:
      1. 在下载中间中拦截请求
      2. 在拦截到的请求的请求头信息中的UA进行篡改伪装
      3. 在配置文找中开启下载中间件
    • 代码展示:
    from scrapy import signals
    import random
    
    
    class CrawlproSpiderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the spider middleware does not modify the
        # passed objects.
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_spider_input(self, response, spider):
            # Called for each response that goes through the spider
            # middleware and into the spider.
    
            # Should return None or raise an exception.
            return None
    
        def process_spider_output(self, response, result, spider):
            # Called with the results returned from the Spider, after
            # it has processed the response.
    
            # Must return an iterable of Request, dict or Item objects.
            for i in result:
                yield i
    
        def process_spider_exception(self, response, exception, spider):
            # Called when a spider or process_spider_input() method
            # (from other spider middleware) raises an exception.
    
            # Should return either None or an iterable of Response, dict
            # or Item objects.
            pass
    
        def process_start_requests(self, start_requests, spider):
            # Called with the start requests of the spider, and works
            # similarly to the process_spider_output() method, except
            # that it doesn’t have a response associated.
    
            # Must return only requests (not items).
            for r in start_requests:
                yield r
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)
    
    
    class CrawlproDownloaderMiddleware(object):
        # Not all methods need to be defined. If a method is not defined,
        # scrapy acts as if the downloader middleware does not modify the
        # passed objects.
    
        proxy_http = [
            "http://113.128.10.121", "http://49.86.181.235", "http://121.225.52.143", "http://180.118.134.29",
            "http://111.177.186.27", "http://175.155.77.189", "http://110.52.235.120", "http://113.128.24.189",
        ]
        proxy_https = [
            "https://93.190.143.59", "https://106.104.168.15", "https://167.249.181.237", "https://124.250.70.76",
            "https://119.101.115.2", "https://58.55.133.48", "https://49.86.177.193", "https://58.55.132.231",
            "https://58.55.133.77", "https://119.101.117.189", "https://27.54.248.42", "https://221.239.86.26",
        ]
        # 拦截请求:request参数就是拦截到的请求
        user_agent_list = [
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
            "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
            "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
            "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
            "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
            "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
            "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
            "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
            "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
            "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
        ]
    
        @classmethod
        def from_crawler(cls, crawler):
            # This method is used by Scrapy to create your spiders.
            s = cls()
            crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
            return s
    
        def process_request(self, request, spider):
            # Called for each request that goes through the downloader
            # middleware.
    
            # Must either:
            # - return None: continue processing this request
            # - or return a Response object
            # - or return a Request object
            # - or raise IgnoreRequest: process_exception() methods of
            #   installed downloader middleware will be called
            print("中间件开始下载", request)
            if request.url.split(":")[0] == "http":
                request.meta["proxy"] = random.choice(self.proxy_http)
            else:
                request.meta["proxy"] = random.choice(self.proxy_https)
    
            request.header["User-Agent"] = random.choice(self.user_agent_list)
    
            print(request.meta["proxy"], request.heaser["User-Agent"])
            return None
    
        def process_response(self, request, response, spider):
            # Called with the response returned from the downloader.
    
            # Must either;
            # - return a Response object
            # - return a Request object
            # - or raise IgnoreRequest
    
    
            return response
    
        def process_exception(self, request, exception, spider):
            # Called when a download handler or a process_request()
            # (from other downloader middleware) raises an exception.
    
            # Must either:
            # - return None: continue processing this exception
            # - return a Response object: stops process_exception() chain
            # - return a Request object: stops process_exception() chain
            pass
    
        def spider_opened(self, spider):
            spider.logger.info('Spider opened: %s' % spider.name)

    代理池

    • 作用尽可能多的将scrapy工程中的请求的IP设置成不同的
    • 操作流程:
      1. 在下载中间件中拦截请求
      2. 将拦截到的请求的IP修改成某一代理IP
      3. 在配置文件中开启下载中间件
    • 代码展示:上边代码里有,参考上边代码即可.
  • 相关阅读:
    JavaScript 基础,登录前端验证
    CSS实例:图片导航块
    导航,头部,CSS基础
    web基础,用html元素制作web页面
    web基础
    timestamp与timedelta,管理信息系统概念与基础
    datetime处理日期和时间
    Linux操作系统编程 实验五 块设备实验
    Linux操作系统编程 实验四 字符设备实验
    Linux操作系统编程 实验三 模块编程实验
  • 原文地址:https://www.cnblogs.com/ljc-0923/p/10331782.html
Copyright © 2020-2023  润新知