• Heritrix 3.1.0 源码解析(十一)


    上文分析了Heritrix3.1.0系统是怎么添加CrawlURI curi对象的,那么在系统初始化的时候,是怎么载入CrawlURI curi种子的呢?

    我们回顾前面的文章,在我们执行采集任务的launch指令的时候,实际会调用CrawlController对象的void requestCrawlStart()方法

    /** 
         * Operator requested crawl begin
         */
        public void requestCrawlStart() {
            hasStarted = true; 
            sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING);
            
            if(recoveryCheckpoint==null) {
                // only announce (trigger scheduling of) seeds
                // when doing a cold (non-recovery) start
                getSeeds().announceSeeds();
            }
            
            setupToePool();
    
            // A proper exit will change this value.
            this.sExit = CrawlStatus.FINISHED_ABNORMAL;
            
            if (getPauseAtStart()) {
                // frontier is already paused unless started, so just 
                // 'complete'/ack pause
                completePause();
            } else {
                getFrontier().run();
            }
        }

    继续调用getSeeds().announceSeeds()方法,这里的getSeeds()真实对象是TextSeedModule(spring自动注入的),然后调用它的void announceSeeds()方法

    /**
         * Announce all seeds from configured source to SeedListeners 
         * (including nonseed lines mixed in). 
         * @see org.archive.modules.seeds.SeedModule#announceSeeds()
         */
        public void announceSeeds() {
            if(getBlockAwaitingSeedLines()>-1) {
                final CountDownLatch latch = new CountDownLatch(getBlockAwaitingSeedLines());
                new Thread(){
                    @Override
                    public void run() {
                        announceSeeds(latch); 
                        while(latch.getCount()>0) {
                            latch.countDown();
                        }
                    }
                }.start();
                try {
                    latch.await();
                } catch (InterruptedException e) {
                    // do nothing
                } 
            } else {
                announceSeeds(null); 
            }
        }

     上面方法中if后面的CountDownLatch latch是线程计数,else后面是null,继续调用void announceSeeds(CountDownLatch latchOrNull)方法 

    protected void announceSeeds(CountDownLatch latchOrNull) {
            BufferedReader reader = new BufferedReader(textSource.obtainReader());       
            try {
                announceSeedsFromReader(reader,latchOrNull);    
            } finally {
                IOUtils.closeQuietly(reader);
            }
        }

     首先获取ReadSource textSource(org.archive.spring.ConfigString)的Reader(StringReader),然后调用void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull)方法 

    /**
         * Announce all seeds (and nonseed possible-directive lines) from
         * the given Reader
         * @param reader source of seed/directive lines
         * @param latchOrNull if non-null, sent countDown after each line, allowing 
         * another thread to proceed after a configurable number of lines processed
         */
        protected void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull) {
            String s;
            Iterator<String> iter = 
                new RegexLineIterator(
                        new LineReadingIterator(reader),
                        RegexLineIterator.COMMENT_LINE,
                        RegexLineIterator.NONWHITESPACE_ENTRY_TRAILING_COMMENT,
                        RegexLineIterator.ENTRY);
    
            int count = 0; 
            while (iter.hasNext()) {
                s = (String) iter.next();
                if(Character.isLetterOrDigit(s.charAt(0))) {
                    // consider a likely URI
                    seedLine(s);
                    count++;
                    if(count%20000==0) {
                        System.runFinalization();
                    }
                } else {
                    // report just in case it's a useful directive
                    nonseedLine(s);
                }
                if(latchOrNull!=null) {
                    latchOrNull.countDown(); 
                }
            }
            publishConcludedSeedBatch(); 
        }

     迭代url字符串并调用void seedLine(String uri)方法

    /**
         * Handle a read line that is probably a seed.
         * 
         * @param uri String seed-containing line
         */
        protected void seedLine(String uri) {
            if (!uri.matches("[a-zA-Z][\\w+\\-]+:.*")) { // Rfc2396 s3.1 scheme,
                                                         // minus '.'
                // Does not begin with scheme, so try http://
                uri = "http://" + uri;
            }
            try {
                UURI uuri = UURIFactory.getInstance(uri);
                CrawlURI curi = new CrawlURI(uuri);
                curi.setSeed(true);
                curi.setSchedulingDirective(SchedulingConstants.MEDIUM);
                if (getSourceTagSeeds()) {
                    curi.setSourceTag(curi.toString());
                }
                publishAddedSeed(curi);
            } catch (URIException e) {
                // try as nonseed line as fallback
                nonseedLine(uri);
            }
        }

    最后调用父类SeedModule的void publishAddedSeed(CrawlURI curi)方法(observer模式)

    protected void publishAddedSeed(CrawlURI curi) {
            for (SeedListener l: seedListeners) {
                l.addedSeed(curi);
            }
        }

    BdbFrontier类间接实现了SeedListener接口(AbstractFrontier抽象类void addedSeed(CrawlURI puri)方法)

    /**
         * When notified of a seed via the SeedListener interface, 
         * schedule it.
         * 
         * @see org.archive.modules.seeds.SeedListener#addedSeed(org.archive.modules.CrawlURI)
         */
        public void addedSeed(CrawlURI puri) {
            schedule(puri);
        }

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/20/3031924.html

  • 相关阅读:
    一些基本的操作,编译,构建,单元测试,安装,网站生成和基于Maven部署项目。
    Maven目标
    Maven是什么?
    Maven的生命周期是为了对所有的构建过程进行了抽象了,便于统一。
    mvn archetype:generate 创建Maven项目
    Maven是一个项目管理工具
    Maven项目对象模型(POM)
    e816. 创建工具栏
    e836. 设置JTabbedPane中卡片的提示语
    e834. 设置JTabbedPane中卡片的位置
  • 原文地址:https://www.cnblogs.com/chenying99/p/3031924.html
Copyright © 2020-2023  润新知