• Heritrix 3.1.0 源码解析(十四)


    我在分析BdbFrontier对象的void schedule(CrawlURI caURI)、CrawlURI next() 、void finished(CrawlURI cURI)方法是,其实还有一些相关环境没有分析,其实我是有点疲倦

    本文接下来分析在多线程环境中Heritrix3.1.0系统怎样保持相关对象属性的一致性以及怎样自定义配置相关对象的属性值

    我们在WorkQueueFrontier类的void schedule(CrawlURI curi)方法里面可以看到

    @Override
        public void schedule(CrawlURI curi) {
            sheetOverlaysManager.applyOverlaysTo(curi);
            try {
                  KeyedProperties.loadOverridesFrom(curi);
                //KeyedProperties.withOverridesDo(ocontext, todo)
                if(curi.getClassKey()==null) {
                    // remedial processing
                    preparer.prepare(curi);
                }
                processScheduleIfUnique(curi);
            } finally {
                KeyedProperties.clearOverridesFrom(curi); 
            }
        }

    这里涉及到SheetOverlaysManager对象以及KeyedProperties对象,首先我们来看KeyedProperties类继承自ConcurrentHashMap<String,Object>同步容器

    它是以map键值对的形式用来在上下文中存储对象的属性,它是将我们需要的属性存储在ThreadLocal变量中(线程局部变量)

     /**
         * ThreadLocal (contextual) collection of pushed override maps
         */
        static ThreadLocal<ArrayList<OverlayContext>> threadOverrides = 
            new ThreadLocal<ArrayList<OverlayContext>>() {
            protected ArrayList<OverlayContext> initialValue() {
                return new ArrayList<OverlayContext>();
            }
        };

    ThreadLocal对象里面保存的是ArrayList<OverlayContext>类型的对象,下面是该ArrayList<OverlayContext>集合对象相关添加和移除方法

    /**
         * Add an override map to the stack 
         * @param m Map to add
         */
        static public void pushOverrideContext(OverlayContext ocontext) {
            threadOverrides.get().add(ocontext);
        }
        
        /**
         * Remove last-added override map from the stack
         * @return Map removed
         */
        static public OverlayContext popOverridesContext() {
            // TODO maybe check that pop is as expected
            return threadOverrides.get().remove(threadOverrides.get().size()-1);
        }
        
        static public void clearAllOverrideContexts() {
            threadOverrides.get().clear(); 
        }
        
        static public void loadOverridesFrom(OverlayContext ocontext) {
            assert ocontext.haveOverlayNamesBeenSet();
            pushOverrideContext(ocontext);
        }
        
        static public boolean clearOverridesFrom(OverlayContext ocontext) {
            return threadOverrides.get().remove(ocontext);
        }
        
        static public void withOverridesDo(OverlayContext ocontext, Runnable todo) {
            try {
                loadOverridesFrom(ocontext);
                todo.run();
            } finally {
                clearOverridesFrom(ocontext); 
            }
        }
    
        public static boolean overridesActiveFrom(OverlayContext ocontext) {
            return threadOverrides.get().contains(ocontext);
        }

    而下面相关方法涉及根据key键值获取属性值(回调OverlayContext接口)

    /** the alternate global property-paths leading to this map 
         * TODO: consider if deterministic ordered list is important */
        HashSet<String> externalPaths = new HashSet<String>(); 
        
        /**
         * Add a path by which the outside world can reach this map
         * @param path String path
         */
        public void addExternalPath(String path) {
            externalPaths.add(path);
        }
    
        /**
         * Get the given value, checking override maps if appropriate.
         * 
         * @param key
         * @return discovered override, or local value
         */
        public Object get(String key) {
            ArrayList<OverlayContext> overlays = threadOverrides.get();
            for(int i = overlays.size()-1; i>=0; i--) {
                OverlayContext ocontext = overlays.get(i); 
                for(int j = ocontext.getOverlayNames().size()-1; j>=0; j--) {
                    String name = ocontext.getOverlayNames().get(j);
                    Map<String,Object> m = ocontext.getOverlayMap(name);
                    for(String ok : getOverrideKeys(key)) {
                        Object val = m.get(ok);
                        if(val!=null) {
                            return val;
                        }
                    }
                }
            }
    
            return super.get(key);
        }
    
        /**
         * Compose the complete keys (externalPath + local key name) to use
         * for checking for contextual overrides. 
         * 
         * @param key local key to compose
         * @return List of full keys to check
         */
        protected List<String> getOverrideKeys(String key) {
            ArrayList<String> keys = new ArrayList<String>(externalPaths.size());
            for(String path : externalPaths) {
                keys.add(path+"."+key);
            }
            return keys;
        }

    OverlayContext接口源码如下:

    /**
     * Interface for objects that can contribute 'overlays' to replace the
     * usual values in configured objects. 
     * @contributor gojomo
     */
    public interface OverlayContext {
        /** test if this context has actually been configured with overlays
         * (even if in fact no overlays were added) */
        public boolean haveOverlayNamesBeenSet();
        /** return a list of the names of overlay maps to consider */ 
        ArrayList<String> getOverlayNames();
        /** get the map corresponding to the overlay name */ 
        Map<String,Object> getOverlayMap(String name);
    }

    OverlayContext接口只有一个实现类CrawlURI,与接口实现相关方法如下(回调OverlayMapsSource接口方法)

     //
        // OverridesSource implementation
        //
        transient protected ArrayList<String> overlayNames = null;
        transient protected OverlayMapsSource overlayMapsSource; 
        public boolean haveOverlayNamesBeenSet() {
            return overlayNames != null;
        }
        
        public ArrayList<String> getOverlayNames() {
            if(overlayNames == null) {
                overlayNames = new ArrayList<String>(); 
            }
            return overlayNames;
        }
    
        public Map<String, Object> getOverlayMap(String name) {
            return overlayMapsSource.getOverlayMap(name);
        }
    
        public void setOverlayMapsSource(OverlayMapsSource overrideMapsSource) {
            this.overlayMapsSource = overrideMapsSource;
        }

    OverlayMapsSource接口源码如下:

    /**
     * Interface for a source of overlay maps by name. 
     * 
     * @contributor gojomo
     */
    public interface OverlayMapsSource {
        public Map<String,Object> getOverlayMap(String name); 
    }

    OverlayMapsSource接口实现类SheetOverlaysManager,SheetOverlaysManager类还实现了BeanFactoryAware接口和ApplicationListener接口

    SheetOverlaysManager类成员如下(这里实质是sheet配置以及CrawlURI对象与Sheet的映射)

    BeanFactory beanFactory; 
        /** all SheetAssociations by DecideRule evaluation */ 
        SortedSet<DecideRuledSheetAssociation> ruleAssociations = 
            new ConcurrentSkipListSet<DecideRuledSheetAssociation>();
        NavigableMap<String,List<String>> sheetNamesBySurt = new ConcurrentSkipListMap<String,List<String>>(); 
        
        /** all sheets by (bean)name*/
        Map<String,Sheet> sheetsByName = new ConcurrentHashMap<String, Sheet>();

    OverlayMapsSource接口实现方法(根据key(String name参数)获取Sheet对象,然后得到Sheet对象的Map容器对象)

     /**
         * Retrieve the named overlay Map.
         * 
         * @see org.archive.spring.OverlayMapsSource#getOverlayMap(java.lang.String)
         */
        public Map<String, Object> getOverlayMap(String name) {
            return sheetsByName.get(name).getMap();
        }

     ApplicationListener接口事项方法如下:

    /** 
         * Ensure all sheets are 'primed' after the entire ApplicatiotnContext
         * is assembled. This ensures target HasKeyedProperties beans know
         * any long paths by which their properties are addressed, and 
         * handles (by either PropertyEditor-conversion or a fast-failure)
         * any type-mismatches between overlay values and their target
         * properties.
         * @see org.springframework.context.ApplicationListener#onApplicationEvent(org.springframework.context.ApplicationEvent)
         */
        public void onApplicationEvent(ApplicationEvent event) {
            if(event instanceof ContextRefreshedEvent) {
                for(Sheet s: sheetsByName.values()) {
                    s.prime(); // exception if Sheet can't target overridable properties
                }
                // log warning for any sheets named but not present
                HashSet<String> allSheetNames = new HashSet<String>();
                for(DecideRuledSheetAssociation assoc : ruleAssociations) {
                    allSheetNames.addAll(assoc.getTargetSheetNames());
                }
                for(List<String> names : sheetNamesBySurt.values()) {
                    allSheetNames.addAll(names);
                }
                for(String name : allSheetNames) {
                    if(!sheetsByName.containsKey(name)) {
                        logger.warning("sheet '"+name+"' referenced but absent");
                    }
                }
            }
        }

    检测sheet配置里面对象与属性类型是否匹配,然后是检测配置是否有不存在的sheet名称 (在初始化时调用)

    void applyOverlaysTo(CrawlURI curi)方法

    /**
         * Apply the proper overlays (by Sheet beanName) to the given CrawlURI,
         * according to configured associations.  
         * 
         * TODO: add guard against redundant application more than once? 
         * TODO: add mechanism for reapplying overlays after settings change? 
         * @param curi
         */
        public void applyOverlaysTo(CrawlURI curi) {
            curi.setOverlayMapsSource(this); 
            // apply SURT-based overlays
            curi.getOverlayNames().clear(); // clear previous info
            String effectiveSurt = SurtPrefixSet.getCandidateSurt(curi.getPolicyBasisUURI());
            List<String> foundPrefixes = PrefixFinder.findKeys(sheetNamesBySurt, effectiveSurt);       
            for(String prefix : foundPrefixes) {
                for(String name : sheetNamesBySurt.get(prefix)) {
                    curi.getOverlayNames().add(name);
                }
            }
            // apply deciderule-based overlays
            for(DecideRuledSheetAssociation assoc : ruleAssociations) {
                if(assoc.getRules().accepts(curi)) {
                    curi.getOverlayNames().addAll(assoc.getTargetSheetNames());
                }
            }
            // even if no overlays set, let creation of empty list signal
            // step has occurred -- helps ensure overlays added once-only
            curi.getOverlayNames();
        }

    首先是设置CrawlURI curi对象的成员变量OverlayMapsSource overlayMapsSource,然后是加载这个CrawlURI curi对象的配置(用于其它对象根据关键字调用,获取相关属性值)

    我们现在要继续了解的是Sheet类,先熟悉一下在crawler-beans.cxml文件里面的配置(用于覆盖相应类的属性) 

    <!-- veryPolite: any URI to which this sheet's settings are applied 
         will cause its queue to take extra-long politeness snoozes -->
    <bean id='veryPolite' class='org.archive.spring.Sheet'>
     <property name='map'>
      <map>
       <entry key='disposition.delayFactor' value='10'/>
       <entry key='disposition.minDelayMs' value='10000'/>
       <entry key='disposition.maxDelayMs' value='1000000'/>
       <entry key='disposition.respectCrawlDelayUpToSeconds' value='3600'/>
      </map>
     </property>
    </bean>

    Sheet的成员变量如下 

    /**
         * unique name of this Sheet; if Sheet has a beanName from original
         * configuration, that is always the name -- but the name might 
         * also be another string, in the case of Sheets added after 
         * initial container wiring
         */
        String name;    
        /** map of full property-paths (from BeanFactory to individual 
         * property) and their changed value when this Sheet of overrides
         * is in effect
         */
        Map<String,Object> map = new ConcurrentHashMap<String, Object>(); 

    void prime()方法用于检测对象与相应属性类型是否匹配 

    /**
         * Ensure any properties targetted by this Sheet know to 
         * check the right property paths for overrides at lookup time,
         * and that the override values are compatible types for their 
         * destination properties. 
         * 
         * Should be done as soon as all possible targets are 
         * constructed (ApplicationListener ContextRefreshedEvent)
         * 
         * TODO: consider if  an 'un-priming' also needs to occur to 
         * prevent confusing side-effects. 
         * TODO: consider if priming should move to another class
         */
        public void prime() {
            for (String fullpath : map.keySet()) {
                int lastDot =  fullpath.lastIndexOf(".");
                String beanPath = fullpath.substring(0,lastDot);
                String terminalProp = fullpath.substring(lastDot+1);
                Object value = map.get(fullpath); 
                int i = beanPath.indexOf(".");
                Object bean; 
                HasKeyedProperties hkp;
                if (i < 0) {
                    bean = beanFactory.getBean(beanPath);
                } else {
                    String beanName = beanPath.substring(0,i);
                    String propPath = beanPath.substring(i+1);
                    BeanWrapperImpl wrapper = new BeanWrapperImpl(beanFactory.getBean(beanName));
                    bean = wrapper.getPropertyValue(propPath);  
                }
                try {
                    hkp = (HasKeyedProperties) bean;
                } catch (ClassCastException cce) {
                    // targetted bean has no overridable properties
                    throw new TypeMismatchException(bean,HasKeyedProperties.class,cce);
                }
                // install knowledge of this path 
                hkp.getKeyedProperties().addExternalPath(beanPath);
                // verify type-compatibility
                BeanWrapperImpl wrapper = new BeanWrapperImpl(hkp);
                Class<?> requiredType = wrapper.getPropertyType(terminalProp);
                try {
                    // convert for destination type
                    map.put(fullpath, wrapper.convertForProperty(value,terminalProp));
                } catch(TypeMismatchException tme) {
                    TypeMismatchException tme2 = 
                        new TypeMismatchException(
                                new PropertyChangeEvent(
                                        hkp,
                                        fullpath,
                                        wrapper.getPropertyValue(terminalProp),
                                        value), requiredType, tme);
                    throw tme2;
                }
            }
        }

    上面需要注意的是hkp.getKeyedProperties().addExternalPath(beanPath)

    KeyedProperties类的HashSet<String> externalPaths = new HashSet<String>() 是不存在重复值的集合,所以上面过滤了重复添加的元素beanPath

    抽象类SheetAssociation定义目标Sheet对象的名称集合 

    /**
     * Represents target Sheets that should be associated with 
     * some grouping of URIs. 
     * 
     * Subclasses specify the kind of association (as by SURT prefix 
     * matching, or arbitrary decide-rules).
     * 
     * @contributor gojomo
     */
    public abstract class SheetAssociation {
        List<String> targetSheetNames = new LinkedList<String>();
        public List<String> getTargetSheetNames() {
            return targetSheetNames;
        }
        public void setTargetSheetNames(List<String> targetSheets) {
            this.targetSheetNames = targetSheets;
        }
    }

    SheetAssociation的子类SurtPrefixesSheetAssociation,用于配置CrawlURI curi对象的classkey值与sheet集合的映射

    /**
     * SheetAssociation applied on the basis of matching SURT prefixes. 
     * 
     * @contributor gojomo
     */
    public class SurtPrefixesSheetAssociation extends SheetAssociation {
        List<String> surtPrefixes;
    
        public List<String> getSurtPrefixes() {
            return surtPrefixes;
        }
        @Required
        public void setSurtPrefixes(List<String> surtPrefixes) {
            this.surtPrefixes = surtPrefixes;
        }
    }

    在crawler-beans.cxml文件里面的配置示例如下

    <bean class='org.archive.crawler.spring.SurtPrefixesSheetAssociation'>
     <property name='surtPrefixes'>
      <list>
       <value>http://(org,example,</value>
       <value>http://(com,example,www,)/</value>
      </list>
     </property>
     <property name='targetSheetNames'>
      <list>
       <value>veryPolite</value>
       <value>smallBudget</value>
      </list>
     </property>
    </bean>

    SheetAssociation类的另外一个子类DecideRuledSheetAssociation代码如下(用于配置DecideRule对象与sheet集合的映射)

    /**
     * SheetAssociation applied on the basis of DecideRules. If the
     * final ruling is ACCEPT, the named sheets will be overlaid. 
     * 
     * @contributor gojomo
     */
    public class DecideRuledSheetAssociation extends SheetAssociation 
    implements Ordered, Comparable<DecideRuledSheetAssociation>, BeanNameAware {
        DecideRule rules;
        int order = 0; 
        
        public DecideRule getRules() {
            return rules;
        }
        @Required
        public void setRules(DecideRule rules) {
            this.rules = rules;
        }
    
        public int getOrder() {
            return order;
        }
        public void setOrder(int order) {
            this.order = order;
        }
    
        // compare on the basis of Ordered value
        public int compareTo(DecideRuledSheetAssociation o) {
            int cmp = order - ((Ordered)o).getOrder();
            if(cmp!=0) {
                return cmp;
            }
            return name.compareTo(o.name); 
        }
        
        String name;
        public void setBeanName(String name) {
            this.name = name;
        }
    }

     在配置文件crawler-beans.cxml里面我还没找到示例(不过我们可以自定义配置了)

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/20/3031937.html

  • 相关阅读:
    去除 CSDN “官方免费去广告 + 万能工具”
    github 搜索技巧常用
    Python 使用 __doc__ 查看文档
    油猴脚本编写自己的脚本来去除知乎 "我们检测到你可能使用了 AdBlock 或 Adblock Plus"
    Unity 中的 C# Instantiate() 方法解析
    《流畅的 Python 》第 2 章笔记
    html 中 a 标签中 href 的路径相关问题
    VScode 复制代码到博客园编辑器自动带上代码标签问题
    Vue在Ubuntu上的部署
    在ubuntu上编译方式安装nginx
  • 原文地址:https://www.cnblogs.com/chenying99/p/3031937.html
Copyright © 2020-2023  润新知