• 使用SpringBoot + seleniumjava 作爬虫


    一、 Selenium 简介

    Selenium 是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操做同样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera等。这个工具的主要功能包括:测试与浏览器的兼容性——测试你的应用程序看是否可以很好得工做在不一样浏览器和操做系统之上。测试系统功能——建立回归测试检验软件功能和用户需求。支持自动录制动做和自动生成 .Net、Java、Perl等不一样语言的测试脚本。html

    二、selenium-java

    selenium-java 是 selenium的java 版,根据不一样driver,能够驱动不一样的浏览区,好比 selenium-chrome-driver、selenium-edge-driver、selenium-firefox-driver、selenium-ie-driver、selenium-opera-driver、phantomjsdriver等等,我用了其中的chromedriver 和 phantomjsdriver,这个能彻底模拟真实用户操做,不错的测试框架。java

    三、 chromedriver 示例

    3.一、 下载

    如下是chromedriver对应的chrome版本:git

    驱动对应版本号
    2.37 v64-66
    2.36 v63-65
    2.35 v62-64
    2.34 v61-63
    2.33 v60-62
    2.32 v59-61
    2.31 v58-60
    2.30 v58-60
    2.29 v56-58

    驱动的下载地址以下:
    http://chromedriver.storage.googleapis.com/index.html
    注意:64位向下兼容,直接下载32位的就能够啦,亲测可用。web

    3.二、 使用

    ChromeOptions options = new ChromeOptions();
    // 设置容许弹框
    options.addArguments("disable-infobars","disable-web-security");
    // 设置无gui 开发时仍是不要加,能够看到浏览器效果
    options.addArguments("--headless");
    String driverPath =  "D:\\crawler-plugin\\chromedriver.exe";
    System.setProperty("webdriver.chrome.driver", driverPath);
    RemoteWebDriver driver=  new ChromeDriver(options);
    driver.get("http://www.baidu.com");
    System.out.println(driver.findElement(By.tagName("body")).getText());

    四、 phantomjsdriver示例

    4.一、 下载

    下载地址 http://phantomjs.org/download.htmlspring

    4.二、 使用

    String driverPath = "D:\\crawler-plugin\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe";
    System.setProperty("phantomjs.binary.path", driverPath);//设置PhantomJs访问路径
    DesiredCapabilities desiredCapabilities = DesiredCapabilities.phantomjs();
    //设置参数
    desiredCapabilities.setCapability("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0");
    desiredCapabilities.setCapability("phantomjs.page.customHeaders.User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101  Firefox/50.0");
    RemoteWebDriver driver =  new PhantomJSDriver(desiredCapabilities);
    driver.get("http://www.baidu.com");
    System.out.println(driver.findElement(By.tagName("body")).getText());

    五、 爬取页面常遇到的问题

    5.一、 验证码

    网站有时候须要登陆,登陆时候遇到验证码就很是棘手,tess4j能作简单的验证码识别,复杂的就别想了。。
    maven 依赖sql

    <dependency>
    	<groupId>net.sourceforge.tess4j</groupId>
    		<artifactId>tess4j</artifactId>
    		<version>3.4.0</version>
    		<exclusions>
    			<exclusion>
    				<groupId>com.sun.jna</groupId>
    				<artifactId>jna</artifactId>
    			</exclusion>
    		</exclusions>
    	</dependency>

    tess4j 配置项chrome

    ##tess4j config
    tess4j.language=chi_sim
    tess4j.language.path=D:\\crawler-plugin\\tessdata
    tess4j.data.path=D:\\crawler-plugin\\

    读取配置文件数据库

    @Configuration
    public class Tess4jConfig {
        @Value("${tess4j.data.path}")
        @Setter
        @Getter
        private String tess4jDataPath ;
    
        @Value("${tess4j.language.path}")
        @Setter
        @Getter
        private String tess4jLanguagePath ;
    
        @Value("${tess4j.language}")
        @Setter
        @Getter
        private String tess4jLanguage ;
    }

    工具类windows

    import com.cdchen.crawler.config.Tess4jConfig;
    import lombok.extern.slf4j.Slf4j;
    import net.sourceforge.tess4j.ITesseract;
    import net.sourceforge.tess4j.Tesseract;
    import net.sourceforge.tess4j.util.LoggHelper;
    import org.slf4j.Logger;
    import org.slf4j.LoggerFactory;
    
    import java.io.File;
    
    @Slf4j
    public class Tess4jUtil {
    
        private static final Logger logger                   = LoggerFactory.getLogger(new LoggHelper().toString());
        static final             double MINIMUM_DESKEW_THRESHOLD = 0.05d;
        private static ITesseract instance;
    
        private static  String datapath = "D:\\crawler-plugin\\";
        private static  String testResourcesLanguagePath = "D:\\crawler-plugin\\tessdata";
        private static  String language = "chi_sim";
    
        private  static ITesseract getInstance(){
    
            Tess4jConfig config = SpringBeanUtil.getBean(Tess4jConfig.class);
            if(config != null){
                datapath = config.getTess4jDataPath();
                language = config.getTess4jLanguage();
                testResourcesLanguagePath = config.getTess4jLanguagePath();
            }
    
            if(datapath == null){
                log.error("必须在properties配置tess4jdata.path,不然验证码没法识别");
                return null;
            }
            if(testResourcesLanguagePath == null){
                log.error("必须在properties配置tess4jlanguage.path,不然验证码没法识别");
                return null;
            }
            if(language == null){
                log.error("必须在properties配置tess4jlanguage,不然验证码没法识别");
                return null;
            }
    
            if(instance == null){
                instance = new Tesseract();
                instance.setDatapath(new File(datapath).getPath());
                //set language
                instance.setDatapath(testResourcesLanguagePath);
                instance.setLanguage(language);
            }
            return instance;
        }
    
        public static String  doOcr(File file) throws  Exception{
            String result = getInstance().doOCR(file);
            return result;
        }
    
    }

    5.二、 翻页

    翻页相对来就很简单了,有不少种解决方法,举例2种
    1 找到翻页url规律,替换对应的页码
    2 找到翻页按钮,模拟点击
    我用的第二种,一下代码为爬取qiushi百科时候的翻页代码,供参考:

    private void jumpPageNum(int pageNum){
            if(WebElementUtils.doesWebElementExist(driver,By.className("pagination"))){
                WebElement pagination = driver.findElement(By.className("pagination"));
                String currentText = pagination.findElement(By.className("current")).getText();
                int currentPageNum = Integer.parseInt(currentText);
                while (currentPageNum != pageNum){
                    List<WebElement> pageNums = pagination.findElements(By.className("page-numbers"));
                    for (int i = 0; i < pageNums.size(); i++) {
                        String pageNumText = pageNums.get(i).getText();
                        if(pageNumText.equals(pageNum+"")){
                            pageNums.get(i).click();
                            scrollBar.toPageEnd();
                            break;
                        }else{
                           if(i == (pageNums.size()-1)){
                               pageNums.get(i).click();
                           }
                        }
                    }
                    pagination = driver.findElement(By.className("pagination"));
                    currentText = pagination.findElement(By.className("current")).getText();
                    currentPageNum = Integer.parseInt(currentText);
                }
            }
        }

    5.三、 滚动条

    滚动条就比较麻烦了,由于Driver没有对应的api操做滚动条(或许我没有找到。。),我用了曲线救国的方法去实现,并且用一样的思路能够解决不少相似的问题。思路就是:使用JavaScript去操做滚动条,
    实现步骤是:

    1. 向页面body内添加script tag,并把想执行的js function 插入进去
    2. 使用driver的js执行引擎去执行js实现效果

    贴出来我写的工具类:

    import com.cdchen.crawler.util.SleepUtil;
    import lombok.Data;
    import lombok.extern.slf4j.Slf4j;
    import org.openqa.selenium.remote.RemoteWebDriver;
    
        /**
         *
         * @description: tb
         *
         * @author: cdchen
         *
         * @create: 2019-04-30 17:08
         **/
        @Data
        @Slf4j
        public class ScrollBar {
    
            RemoteWebDriver driver = null;
    
    
            private static String getScrollTopJs = "function getScrollTop(){"
                                                   + "  var scrollTop = 0, bodyScrollTop = 0, documentScrollTop = 0;"
                                                   + "  if(document.body){"
                                                   + "    bodyScrollTop = document.body.scrollTop;"
                                                   + "  }"
                                                   + "  if(document.documentElement){"
                                                   + "    documentScrollTop = document.documentElement.scrollTop;"
                                                   + "  }"
                                                   + "  scrollTop = (bodyScrollTop - documentScrollTop > 0) ? bodyScrollTop : documentScrollTop;"
                                                   + "  return scrollTop;"
                                                   + "};";
    
            private static String getScrollHeightJs = "function getScrollHeight(){"
                                                      + "  var scrollHeight = 0, bodyScrollHeight = 0, documentScrollHeight = 0;"
                                                      + "  if(document.body){"
                                                      + "    bodyScrollHeight = document.body.scrollHeight;"
                                                      + "  }"
                                                      + "  if(document.documentElement){"
                                                      + "    documentScrollHeight = document.documentElement.scrollHeight;"
                                                      + "  }"
                                                      + "  scrollHeight = (bodyScrollHeight - documentScrollHeight > 0) ? bodyScrollHeight : documentScrollHeight;"
                                                      + "  return scrollHeight;"
                                                      + "};";
    
            private static String getWindowHeightJs = "function getWindowHeight(){"
                                                      + "  var windowHeight = 0;"
                                                      + "  if(document.compatMode == \"CSS1Compat\"){"
                                                      + "    windowHeight = document.documentElement.clientHeight;"
                                                      + "  }else{"
                                                      + "    windowHeight = document.body.clientHeight;"
                                                      + "  }"
                                                      + "  return windowHeight;"
                                                      + "};";
    
            private static String scroollIsOverJs = "function scroollIsOver(){"
                                                    + "  if(getScrollTop() + getWindowHeight() == getScrollHeight()){"
                                                    + "    return true;"
                                                    + "  }else{"
                                                    + "    return false;"
                                                    + "  }"
                                                    + "};";
    
            private static String insertScriptJs = "var body = document.getElementsByTagName('body')[0];"
                                                   + "var newScript = document.createElement('script');"
                                                   + "newScript.type = 'text/javascript';"
                                                   + "newScript.innerHTML = '"+getScrollTopJs+getScrollHeightJs+getWindowHeightJs+scroollIsOverJs+"';"
                                                   + "body.appendChild(newScript);";
            public ScrollBar(RemoteWebDriver dr){
                driver = dr;
            }
            public  void toPageEnd() {
                getDriver().executeScript(insertScriptJs);
                int start = 0;
                boolean scroollIsOver = false;
                while (!scroollIsOver){
                    getDriver().executeScript("window.scrollTo("+start+","+(start+500)+")");
                    Boolean res = (Boolean)getDriver().executeScript("return  scroollIsOver();");
                    if(res != null && res){
                        scroollIsOver = true;
                    }
                    start = start+500;
                    SleepUtil.sleep(1000);
                }
            }
    
        }

    5.四、 iframe 内元素没法获取

    有时候页面内置了iframe,想要获取iframe的元素就获取不了,这是就须要把driver切换到iframe内,以下代码:

    WebElement iframe = driver.findElement(By.tagName("iframe"));
    driver.switchTo().frame(iframe);
    // 而后再去获取元素或者其余操做、操做完需切换回来
    driver.switchTo().parentFrame();

    5.五、 标签页切换

    有时候用chromedriver 时候须要开启多个标签页如何操做?

    List<String> tabs = new ArrayList<String>(driver.getWindowHandles());
    // 切换到第一个标签
    driver.switchTo().window(tabs.get(0));
    // 切换到第 **n**个标签
    driver.switchTo().window(tabs.get(n));
  • 相关阅读:
    VS2013 快捷键乱掉如何修改回来
    WCF契约之---服务契约 、数据契约、 消息契约
    成功的背后!(给所有IT人)----转载:来自CSDN第一名博主
    C# Attribute(特性)之---契约---[ServiceContract] 、 [OperationContract]
    SQL Server数据库连接字符串整理
    大写String和小写string的区别
    C#经典之Application.DoEvents()的使用
    C#实现多态之一抽象
    C# Attribute(特性)之---数据契约 [DataContract]
    Druid数据库连接池获取连接阻塞(转载)
  • 原文地址:https://www.cnblogs.com/yuarvin/p/16269056.html
Copyright © 2020-2023  润新知