• Heritrix 3.1.0 源码解析(三)


    如果从heritrix3.1.0系统的静态逻辑结构入手,往往看不到系统相关对象的交互作用;如果只从系统的对象动态结构 分析,则又看不到系统的逻辑轮廓

    所以源码分析需要动静兼顾,使我们更容易理解它的逻辑与交互,本文采用这个分析方法入手

    本文要分析的是spring给Heritrix3.1.0系统bean带来了什么样的管理方式,spring容器的配置文件我们已从上文有了初步的了解

    先了解spring容器在系统中是怎样加载配置文件以及怎么初始化的,当我们执行采集任务的build操作时

    调用CrawlJob对象的void validateConfiguration()

    /**
         * Does the assembled ApplicationContext self-validate? Any failures
         * are reported as WARNING log events in the job log. 
         * 
         * TODO: make these severe? 
         */
        public synchronized void validateConfiguration() {
            instantiateContainer();
            if(ac==null) {
                // fatal errors already encountered and reported
                return; 
            }
            ac.validate();
            HashMap<String,Errors> allErrors = ac.getAllErrors();
            for(String name : allErrors.keySet()) {
                for(Object err : allErrors.get(name).getAllErrors()) {
                   LOGGER.log(Level.WARNING,err.toString());
                }
            }
        }

    首先加载spring配置文件,初始化spring容器;然后是验证容器

    /**
         * Can the configuration yield an assembled ApplicationContext? 
         */
        public synchronized void instantiateContainer() {
            checkXML(); 
            if(ac==null) {
                try {
                    ac = new PathSharingContext(new String[] {"file:"+primaryConfig.getAbsolutePath()},false,null);
                    ac.addApplicationListener(this);
                    ac.refresh();
                    getCrawlController(); // trigger NoSuchBeanDefinitionException if no CC
                    getJobLogger().log(Level.INFO,"Job instantiated");
                } catch (BeansException be) {
                    // Calling doTeardown() and therefore ac.close() here sometimes
                    // triggers an IllegalStateException and logs stack trace from
                    // within spring, even if ac.isActive(). So, just null it.
                    ac = null;
                    beansException(be);
                }
            }
        }

    上面方法是装载配置文件,添加CrawlJob对象监听器

    Heritrix3.1.0的spring容器是经过系统封装的PathSharingContext对象,PathSharingContext类继承自spring的FileSystemXmlApplicationContext类,在它的构造函数里面传入配置文件

    public PathSharingContext(String[] configLocations, boolean refresh, ApplicationContext parent) throws BeansException {
            super(configLocations, refresh, parent);
        }

    当我们执行采集任务的launch操作时,调用CrawlJob对象的void launch()方法

    /**
         * Launch a crawl into 'running' status, assembling if necessary. 
         * 
         * (Note the crawl may have been configured to start in a 'paused'
         * state.) 
         */
        public synchronized void launch() {
            if (isProfile()) {
                throw new IllegalArgumentException("Can't launch profile" + this);
            }
            
            if(isRunning()) {
                getJobLogger().log(Level.SEVERE,"Can't relaunch running job");
                return;
            } else {
                CrawlController cc = getCrawlController();
                if(cc!=null && cc.hasStarted()) {
                    getJobLogger().log(Level.SEVERE,"Can't relaunch previously-launched assembled job");
                    return;
                }
            }
            
            validateConfiguration();
            if(!hasValidApplicationContext()) {
                getJobLogger().log(Level.SEVERE,"Can't launch problem configuration");
                return;
            }
    
            //final String job = changeState(j, ACTIVE);
            
            // this temporary thread ensures all crawl-created threads
            // land in the AlertThreadGroup, to assist crawl-wide 
            // logging/alerting
            alertThreadGroup = new AlertThreadGroup(getShortName());
            alertThreadGroup.addLogger(getJobLogger());
            Thread launcher = new Thread(alertThreadGroup, getShortName()+" launchthread") {
                public void run() {
                    CrawlController cc = getCrawlController();
                    startContext();
                    if(cc!=null) {
                        cc.requestCrawlStart();
                    }
                }
            };
            getJobLogger().log(Level.INFO,"Job launched");
            scanJobLog();
            launcher.start();
            // look busy (and give startContext/crawlStart a chance)
            try {
                Thread.sleep(1500);
            } catch (InterruptedException e) {
                // do nothing
            }
        }

    这里的重要方法是线程对象里面的void startContext()

    /**
         * Start the context, catching and reporting any BeansExceptions.
         */
        protected synchronized void startContext() {
            try {
                ac.start(); 
                
                // job log file covering just this launch
                getJobLogger().removeHandler(currentLaunchJobLogHandler);
                File f = new File(ac.getCurrentLaunchDir(), "job.log");
                currentLaunchJobLogHandler = new FileHandler(f.getAbsolutePath(), true);
                currentLaunchJobLogHandler.setFormatter(new JobLogFormatter());
                getJobLogger().addHandler(currentLaunchJobLogHandler);
                
            } catch (BeansException be) {
                doTeardown();
                beansException(be);
            } catch (Exception e) {
                LOGGER.log(Level.SEVERE,e.getClass().getSimpleName()+": "+e.getMessage(),e);
                try {
                    doTeardown();
                } catch (Exception e2) {
                    e2.printStackTrace(System.err);
                }        
            }
        }

    该方法调用PathSharingContext对象的start方法

     @Override
        public void start() {
            initLaunchDir();
            super.start();
        }

    在上面方法里面,会执行spring容器里面所有bean(实现Lifecycle接口)的start方法

    Lifecycle接口声明的方法如下,定义了bean组件的生命周期

    public interface Lifecycle {
    
        /**
         * Start this component.
         * Should not throw an exception if the component is already running.
         * <p>In the case of a container, this will propagate the start signal
         * to all components that apply.
         */
        void start();
    
        /**
         * Stop this component.
         * Should not throw an exception if the component isn't started yet.
         * <p>In the case of a container, this will propagate the stop signal
         * to all components that apply.
         */
        void stop();
    
        /**
         * Check whether this component is currently running.
         * <p>In the case of a container, this will return <code>true</code>
         * only if <i>all</i> components that apply are currently running.
         * @return whether the component is currently running
         */
        boolean isRunning();
    
    }

    从这里我们可以知道,Heritrix3.1.0系统是通过spring容器统一管理bean的生命周期(主要是初始化状态)的 

    本文通过打印输出了调用了系统哪些bean的start方法

    name:scope
    name:loggerModule||org.archive.crawler.reporting.CrawlerLoggerModule
    name:scope||org.archive.modules.deciderules.DecideRuleSequence
    
    name:candidateScoper
    name:candidateScoper||org.archive.crawler.prefetch.CandidateScoper
    
    name:preparer
    name:preparer||org.archive.crawler.prefetch.FrontierPreparer
    
    name:candidateProcessors
    name:candidateProcessors||org.archive.modules.CandidateChain
    
    name:preselector
    name:preselector||org.archive.crawler.prefetch.MyPreselector
    
    name:preconditions
    name:bdb||org.archive.bdb.BdbModule
    name:serverCache||org.archive.modules.net.BdbServerCache
    name:preconditions||org.archive.crawler.prefetch.PreconditionEnforcer
    
    name:fetchDns
    name:fetchDns||org.archive.modules.fetcher.FetchDNS
    
    name:fetchHttp
    name:cookieStorage||org.archive.modules.fetcher.BdbCookieStorage
    name:fetchHttp||org.archive.modules.fetcher.FetchHTTP
    
    name:extractorHttp
    name:statisticsTracker||org.archive.crawler.reporting.StatisticsTracker
    name:extractorHtml||org.archive.modules.extractor.ExtractorHTML
    name:extractorCss||org.archive.modules.extractor.ExtractorCSS
    name:extractorJs||org.archive.modules.extractor.ExtractorJS
    name:extractorSwf||org.archive.modules.extractor.ExtractorSWF
    name:fetchProcessors||org.archive.modules.FetchChain
    name:warcWriter||org.archive.modules.writer.MyWriterProcessor
    name:candidates||org.archive.crawler.postprocessor.CandidatesProcessor
    name:disposition||org.archive.crawler.postprocessor.DispositionProcessor
    name:dispositionProcessors||org.archive.modules.DispositionChain
    name:crawlController||org.archive.crawler.framework.CrawlController
    name:uriUniqFilter||org.archive.crawler.util.BdbUriUniqFilter
    name:frontier||org.archive.crawler.frontier.BdbFrontier
    
    name:actionDirectory
    name:actionDirectory||org.archive.crawler.framework.ActionDirectory
    
    name:checkpointService
    name:checkpointService||org.archive.crawler.framework.CheckpointService

    --------------------------------------------------------------------------- 

    本系列Heritrix 3.1.0 源码解析系本人原创 

    转载请注明出处 博客园 刺猬的温驯 

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/17/3025410.html 

  • 相关阅读:
    机器学习1
    第15次作业
    算符优先分析
    自下而上语法分析
    实验二 递归下降语法分析
    LL(1)文法的判断,递归下降分析程序
    消除左递归
    【shell】通过shell编写ping包及arp的监控并发送短信
    os和sys模块
    time模块和random模块
  • 原文地址:https://www.cnblogs.com/chenying99/p/3025410.html
Copyright © 2020-2023  润新知