• 爬虫----设置代理HttpClientDownloader


    从0.7.1版本开始,WebMagic开始使用了新的代理APIProxyProvider。因为相对于Site的“配置”,ProxyProvider定位更多是一个“组件”,所以代理不再从Site设置,而是由HttpClientDownloader设置

    API说明
    HttpClientDownloader.setProxyProvider(ProxyProvider proxyProvider) 设置代理

    ProxyProvider有一个默认实现:SimpleProxyProvider。它是一个基于简单Round-Robin的、没有失败检查的ProxyProvider。可以配置任意个候选代理,每次会按顺序挑选一个代理使用。它适合用在自己搭建的比较稳定的代理的场景。

    代理示例:

    1. 设置单一的普通HTTP代理为101.101.101.101的8888端口,并设置密码为"username","password"
        HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
        httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("101.101.101.101",8888,"username","password")));
        spider.setDownloader(httpClientDownloader);
    
    1. 设置代理池,其中包括101.101.101.101和102.102.102.102两个IP,没有密码
        HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
        httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(
        new Proxy("101.101.101.101",8888)
        ,new Proxy("102.102.102.102",8888)));
    

    如果对于代理部分有建议的,欢迎参与讨论#579 更多的代理ProxyProvider实现

    package com.mwq.job.task;
    
    import org.springframework.scheduling.annotation.Scheduled;
    import org.springframework.stereotype.Component;
    import us.codecraft.webmagic.Page;
    import us.codecraft.webmagic.Site;
    import us.codecraft.webmagic.Spider;
    import us.codecraft.webmagic.downloader.HttpClientDownloader;
    import us.codecraft.webmagic.processor.PageProcessor;
    import us.codecraft.webmagic.proxy.Proxy;
    import us.codecraft.webmagic.proxy.SimpleProxyProvider;
    
    @Component
    public class ProxyTest implements PageProcessor {
        @Scheduled(fixedDelay = 1000)
        public void process(){
            //创建下载器
            HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
            //给下载器设置代理服务器信息
            httpClientDownloader.setProxyProvider(SimpleProxyProvider.from(new Proxy("150.109.32.166",80)));
            Spider.create(new ProxyTest())
                    .addUrl("http://ip.chinaz.com/getip.aspx")
                    .setDownloader(httpClientDownloader)
                    .run();
        }
        @Override
        public void process(Page page) {
            System.out.println(page.getHtml().toString());
        }
    
    
    
        Site site = Site.me();
        @Override
        public Site getSite() {
            return site;
        }
    }

    提供两个免费代理服务商网站:

    米扑代理:https://proxy.mimvp.com/free.php

    西刺免费代理:http://www.xicidaili.com/

  • 相关阅读:
    图解AVL树
    浅析Java7中的ConcurrentHashMap
    浅析CopyOnWriteArrayList
    浅析CAS与AtomicInteger原子类
    IDEA左侧不以树形结构展示项目结构
    maven常见问题
    Mybatis实现多表联合查询
    Mybatis实现单表增删改查操作
    解决mybaits配置错误:Cause: org.xml.sax.SAXParseException; lineNumber: 17; columnNumber: 119; 对实体 "characterEncoding" 的引用必须以 ';' 分隔符结尾。
    Deepin_运维实践系列博客导航
  • 原文地址:https://www.cnblogs.com/mwq1992/p/14219596.html
Copyright © 2020-2023  润新知