• Heritrix 3.1.0 源码解析(二十六)


    上文分析了Heritrix3.1.0系统对HttpClient组件的请求处理类的封装,本文接下来分析Heritrix3.1.0系统是怎样封装请求证书的

    Heritrix3.1.0系统的package org.archive.modules.credential里面的相关类都是与请求证书有关的

    先来了解一下CredentialStore类,该类用Map类型存储了应用的所有证书(Credential),外部只要调用这个类就可以获取证书

    该类重要方法如下

    KeyedProperties kp = new KeyedProperties();
        public KeyedProperties getKeyedProperties() {
            return kp;
        }
        
        /**
         * Credentials used by heritrix authenticating. See
         * http://crawler.archive.org/proposals/auth/ for background.
         * 
         * @see http://crawler.archive.org/proposals/auth/
         */
        {
            setCredentials(new HashMap<String, Credential>());
        }
        @SuppressWarnings("unchecked")
        public Map<String,Credential> getCredentials() {
            return (Map<String,Credential>) kp.get("credentials");
        }
        public void setCredentials(Map<String,Credential> map) {
            kp.put("credentials",map);
        }
        
        /**
         * List of possible credential types as a List.
         *
         * This types are inner classes of this credential type so they cannot
         * be created without their being associated with a credential list.
         */
        private static final List<Class<?>> credentialTypes;
        // Initialize the credentialType data member.
        static {
            // Array of all known credential types.
            Class<?> [] tmp = {HtmlFormCredential.class, HttpAuthenticationCredential.class};
            credentialTypes = Collections.unmodifiableList(Arrays.asList(tmp));
        }
    
        /**
         * Constructor.
         */
        public CredentialStore() {
        }
    
        /**
         * @return Unmodifable list of credential types.
         */
        public static List<Class<?>> getCredentialTypes() {
            return CredentialStore.credentialTypes;
        }
    
    
        /**
         * @param context Pass a ProcessorURI.  Used to set
         * context.
         * @return An iterator or null.
         */
        public Collection<Credential> getAll() {
            Map<String,Credential> map = getCredentials();
            return map.values();
        }
    
        /**
         * @param context  Used to set context.
         * @param name Name to give the manufactured credential.  Should be unique
         * else the add of the credential to the list of credentials will fail.
         * @return Returns <code>name</code>'d credential.
         * @throws AttributeNotFoundException
         * @throws MBeanException
         * @throws ReflectionException
         */
        public Credential get(/*StateProvider*/Object context, String name) {
            return getCredentials().get(name);
        }
    /**
         * Return set made up of all credentials of the passed
         * <code>type</code>.
         *
         * @param context  Used to set context.  
         * @param type Type of the list to return.  Type is some superclass of
         * credentials.
         * @param rootUri RootUri to match.  May be null.  In this case we return
         * all.  Currently we expect the CrawlServer name to equate to root Uri.
         * Its not.  Currently it doesn't distingush between servers of same name
         * but different ports (e.g. http and https).
         * @return Unmodifable sublist of all elements of passed type.
         */
        public Set<Credential> subset(CrawlURI context, Class<?> type, String rootUri) {
            Set<Credential> result = null;
            for (Credential c: getAll()) {
                if (!type.isInstance(c)) {
                    continue;
                }
                if (rootUri != null) {
                    String cd = c.getDomain();
                    if (cd == null) {
                        continue;
                    }
                    if (!rootUri.equalsIgnoreCase(cd)) {
                        continue;
                    }
                }
                if (result == null) {
                    result = new HashSet<Credential>();
                }
                result.add(c);
            }
            return result;
        }

    上面方法分别提供了获取所有证书(Map类型),根据名称(Map的key键)获取证书和获取所有证书类型

    (注意到最后的subset方法,好像没有用到CrawlURI context参数,方法返回的只能是指定域并且指定证书类型的证书集合)

    从它的静态代码块可以看到,系统提供了两种类型的证书类型,分别是HtmlFormCredential.class, HttpAuthenticationCredential.class,前者用于form认证,后者用于Basic/Digest HTTP认证

    两种证书类型继承自抽象类Credential,先看一下该抽象类的方法

        /**
         *域名
         * The root domain this credential goes against: E.g. www.archive.org
         */
        String domain = "";
        /**
         * @param context Context to use when searching for credential domain.
         * @return The domain/root URI this credential is to go against.
         * @throws AttributeNotFoundException If attribute not found.
         */
        public String getDomain() {
            return this.domain;
        }
        public void setDomain(String domain) {
            this.domain = domain;
        }
    /**
         *为CrawlURI curi对象添加当前证书
         * Attach this credentials avatar to the passed <code>curi</code> .
         *
         * Override if credential knows internally what it wants to attach as
         * payload.  Otherwise, if payload is external, use the below
         * {@link #attach(CrawlURI, String)}.
         *
         * @param curi CrawlURI to load with credentials.
         */
        public void attach(CrawlURI curi) {
            curi.getCredentials().add(this);
        }
    
        /**
         *为CrawlURI curi对象移除当前证书
         * Detach this credential from passed curi.
         *
         * @param curi
         * @return True if we detached a Credential reference.
         */
        public boolean detach(CrawlURI curi) {
            return curi.getCredentials().remove(this);
        }
    
        /**
         *为CrawlURI curi对象移除所有证书
         * Detach all credentials of this type from passed curi.
         *
         * @param curi
         * @return True if we detached references.
         */
        public boolean detachAll(CrawlURI curi) {
            boolean result = false;
            Iterator<Credential> iter = curi.getCredentials().iterator();
            while (iter.hasNext()) {
                Credential cred = iter.next();
                if (cred.getClass() ==  this.getClass()) {
                    iter.remove();
                    result = true;
                }
            }
            return result;
        }
    
        /**
         *判断CrawlURI curi对象是否需要当前证书认证
         * @param curi CrawlURI to look at.
         * @return True if this credential IS a prerequisite for passed
         * CrawlURI.
         */
        public abstract boolean isPrerequisite(CrawlURI curi);
    
        /**
         *判断CrawlURI curi对象是否存在认证URI
         * @param curi CrawlURI to look at.
         * @return True if this credential HAS a prerequisite for passed CrawlURI.
         */
        public abstract boolean hasPrerequisite(CrawlURI curi);
    
        /**
         *获取CrawlURI curi对象的认证URI
         * Return the authentication URI, either absolute or relative, that serves
         * as prerequisite the passed <code>curi</code>.
         *
         * @param curi CrawlURI to look at.
         * @return Prerequisite URI for the passed curi.
         */
        public abstract String getPrerequisite(CrawlURI curi);
    
        /**
         *获取CrawlURI curi对象的认证URI
         * @param context Context to use when searching for credential domain.
         * @return Key that is unique to this credential type.
         * @throws AttributeNotFoundException
         */
        public abstract String getKey();
    
    
        /**
         *判断CrawlURI curi对象是否每次都要认证
         * @return True if this credential is of the type that needs to be offered
         * on each visit to the server (e.g. Rfc2617 is such a type).
         */
        public abstract boolean isEveryTime();
    
        /**
         *为HttpMethod method添加认证参数
         * @param curi CrawlURI to as for context.
         * @param http Instance of httpclient.
         * @param method Method to populate.
         * @return True if added a credentials.
         */
        public abstract boolean populate(CrawlURI curi, HttpClient http,
            HttpMethod method);
    
        /**
         *是否post认证
         * @param curi CrawlURI to look at.
         * @return True if this credential is to be posted.  Return false if the
         * credential is to be GET'd or if POST'd or GET'd are not pretinent to this
         * credential type.
         */
        public abstract boolean isPost();
    
        /**
         * 判断CrawlURI curi对象的CrawlServer类中的名称与当前认证对象的域名是否一致(用于排除不需要当前认证的CrawlURI curi对象)
         * Test passed curi matches this credentials rootUri.
         * @param controller
         * @param curi CrawlURI to test.
         * @return True if domain for credential matches that of the passed curi.
         */
        public boolean rootUriMatch(ServerCache cache, 
                CrawlURI curi) {
            String cd = getDomain();
    
            CrawlServer serv = cache.getServerFor(curi.getUURI());
            String serverName = serv.getName();
    //        String serverName = controller.getServerCache().getServerFor(curi).
    //            getName();
            logger.fine("RootURI: Comparing " + serverName + " " + cd);
            return cd != null && serverName != null &&
                serverName.equalsIgnoreCase(cd);
        }

    上述方法的功能是为CrawlURI curi对象添加当前证书、移除当前证书、为HttpMethod method对象添加证书参数、判断CrawlURI curi对象的域名与当前证书的域名是否一致等

    HtmlFormCredential对象继承自上述证书类Credential,为CrawlURI curi对象提供form认证,相关方法实现如下

    /**
         * Full URI of page that contains the HTML login form we're to apply these
         * credentials too: E.g. http://www.archive.org
         */
        String loginUri = "";
        public String getLoginUri() {
            return this.loginUri;
        }
        public void setLoginUri(String loginUri) {
            this.loginUri = loginUri;
        }
        
        /**
         * Form items.
         */
        Map<String,String> formItems = new HashMap<String,String>();
        public Map<String,String> getFormItems() {
            return this.formItems;
        }
        public void setFormItems(Map<String,String> formItems) {
            this.formItems = formItems;
        }
        
        
        enum Method {
            GET,
            POST
        }
        /**
         * GET or POST.
         */
        Method httpMethod = Method.POST;
        public Method getHttpMethod() {
            return this.httpMethod;
        }
        public void setHttpMethod(Method method) {
            this.httpMethod = method; 
        }
    
        /**
         * Constructor.
         */
        public HtmlFormCredential() {
        }
    
        public boolean isPrerequisite(final CrawlURI curi) {
            boolean result = false;
            String curiStr = curi.getUURI().toString();
            String loginUri = getPrerequisite(curi);
            if (loginUri != null) {
                try {
    //登录url UURI uuri
    = UURIFactory.getInstance(curi.getUURI(), loginUri); if (uuri != null && curiStr != null && uuri.toString().equals(curiStr)) { result = true; if (!curi.isPrerequisite()) { curi.setPrerequisite(true); logger.fine(curi + " is prereq."); } } } catch (URIException e) { logger.severe("Failed to uuri: " + curi + ", " + e.getMessage()); } } return result; } public boolean hasPrerequisite(CrawlURI curi) { return getPrerequisite(curi) != null; } public String getPrerequisite(CrawlURI curi) { return getLoginUri(); } public String getKey() { return getLoginUri(); } public boolean isEveryTime() { // This authentication is one time only. return false; } public boolean populate(CrawlURI curi, HttpClient http, HttpMethod method) { // http is not used boolean result = false; Map<String,String> formItems = getFormItems(); if (formItems == null || formItems.size() <= 0) { try { logger.severe("No form items for " + method.getURI()); } catch (URIException e) { logger.severe("No form items and exception getting uri: " + e.getMessage()); } return result; } NameValuePair[] data = new NameValuePair[formItems.size()]; int index = 0; String key = null; for (Iterator<String> i = formItems.keySet().iterator(); i.hasNext();) { key = i.next(); data[index++] = new NameValuePair(key, (String)formItems.get(key)); } if (method instanceof PostMethod) { ((PostMethod)method).setRequestBody(data); result = true; } else if (method instanceof GetMethod) { // Append these values to the query string. // Get current query string, then add data, then get it again // only this time its our data only... then append. HttpMethodBase hmb = (HttpMethodBase)method; String currentQuery = hmb.getQueryString(); hmb.setQueryString(data); String newQuery = hmb.getQueryString(); hmb.setQueryString( ((StringUtils.isNotEmpty(currentQuery)) ? currentQuery + "&" : "") + newQuery); result = true; } else { logger.severe("Unknown method type: " + method); } return result; } public boolean isPost() { return Method.POST.equals(getHttpMethod()); }

    上述方法的功能 我在它的接口方法里面已经注释了,这里不再重复

    另外HttpAuthenticationCredential证书类提供了Basic/Digest HTTP认证功能,源码我就不具体分析了,可以参照HtmlFormCredential类的认证功能对比不难理解了

    在Heritrix3.1.0官方的参考文档里面提供了两种认证方式在配置文件crawler-beans.cxml中的示例(官方的示例里面关键词有误)

    <bean id="credentialStore"
       class="org.archive.modules.credential.CredentialStore">
         <property name="credentials">
           <map>
             <entry key="formCredential" value-ref="formCredential" />
           </map>
     </property>
    </bean>
    <bean id="credential"
       class="org.archive.modules.credential.HtmlFormCredential"> 
        <property name="domain" value="example.com" /> 
        <property name="login-uri" value="http://example.com/login"/> 
        <property name="form-items">
            <map>
                <entry key="login" value="mylogin"/>
                <entry key="password" value="mypassword"/>
                <entry key="submit" value="submit"/>
            </map>
        </property>
    </bean>
    <bean id="credential"
      class="org.archive.modules.credential.HttpAuthenticationCredential"> 
        <property name="domain"><value>domain</value></property> 
        <property name="realm"><value>myrealm</value></property> 
        <property name="login"><value>mylogin</value></property> 
        <property name="password"><value>mypassword</value></property> 
    </bean>

    ---------------------------------------------------------------------------

    本系列Heritrix 3.1.0 源码解析系本人原创

    转载请注明出处 博客园 刺猬的温驯

    本文链接 http://www.cnblogs.com/chenying99/archive/2013/04/28/3049042.html

  • 相关阅读:
    BZOJ2648: SJY摆棋子
    BZOJ1925: [Sdoi2010]地精部落
    BZOJ1941: [Sdoi2010]Hide and Seek
    BZOJ2434: [Noi2011]阿狸的打字机
    BZOJ3295: [Cqoi2011]动态逆序对
    BZOJ1406: [AHOI2007]密码箱
    BZOJ1115: [POI2009]石子游戏Kam
    BZOJ1531: [POI2005]Bank notes
    BZOJ2730: [HNOI2012]矿场搭建
    计算几何《简单》入土芝士
  • 原文地址:https://www.cnblogs.com/chenying99/p/3049042.html
Copyright © 2020-2023  润新知