• Apache Tika源码研究(二)


    上文分析了Apache Tika的编码识别相关接口和实现类

    本文接着分析Apache Tika用到的一个关键类ParseContext,这里要明白Tika解析文档的方式,Tika将文件都解析为XHTML格式的文档,然后采用SAX基于事件的方式来解析这个XHTML格式,先来看看ParseContext类的源码:

    public class ParseContext implements Serializable {
    
        /** Serial version UID. */
        private static final long serialVersionUID = -5921436862145826534L;
    
        /** Map of objects in this context */
        private final Map<String, Object> context = new HashMap<String, Object>();
     
        /**
         * Adds the given value to the context as an implementation of the given
         * interface.
         *
         * @param key the interface implemented by the given value
         * @param value the value to be added, or <code>null</code> to remove
         */
        public <T> void set(Class<T> key, T value) {
            if (value != null) {
                context.put(key.getName(), value);
            } else {
                context.remove(key.getName());
            }
        }
    
        /**
         * Returns the object in this context that implements the given interface.
         *
         * @param key the interface implemented by the requested object
         * @return the object that implements the given interface,
         *         or <code>null</code> if not found
         */
        @SuppressWarnings("unchecked")
        public <T> T get(Class<T> key) {
            return (T) context.get(key.getName());
        }
    
        /**
         * Returns the object in this context that implements the given interface,
         * or the given default value if such an object is not found.
         *
         * @param key the interface implemented by the requested object
         * @param defaultValue value to return if the requested object is not found
         * @return the object that implements the given interface,
         *         or the given default value if not found
         */
        public <T> T get(Class<T> key, T defaultValue) {
            T value = get(key);
            if (value != null) {
                return value;
            } else {
                return defaultValue;
            }
        }
    
        /**
         * Returns the SAX parser specified in this parsing context. If a parser
         * is not explicitly specified, then one is created using the specified
         * or the default SAX parser factory.
         *
         * @see #getSAXParserFactory()
         * @since Apache Tika 0.8
         * @return SAX parser
         * @throws TikaException if a SAX parser could not be created
         */
        public SAXParser getSAXParser() throws TikaException {
            SAXParser parser = get(SAXParser.class);
            if (parser != null) {
                return parser;
            } else {
                try {
                    return getSAXParserFactory().newSAXParser();
                } catch (ParserConfigurationException e) {
                    throw new TikaException("Unable to configure a SAX parser", e);
                } catch (SAXException e) {
                    throw new TikaException("Unable to create a SAX parser", e);
                }
            }
        }
    
        /**
         * Returns the SAX parser factory specified in this parsing context.
         * If a factory is not explicitly specified, then a default factory
         * instance is created and returned. The default factory instance is
         * configured to be namespace-aware and to use
         * {@link XMLConstants#FEATURE_SECURE_PROCESSING secure XML processing}.
         *
         * @since Apache Tika 0.8
         * @return SAX parser factory
         */
        public SAXParserFactory getSAXParserFactory() {
            SAXParserFactory factory = get(SAXParserFactory.class);
            if (factory == null) {
                factory = SAXParserFactory.newInstance();
                factory.setNamespaceAware(true);
                try {
                    factory.setFeature(
                            XMLConstants.FEATURE_SECURE_PROCESSING, true);
                } catch (ParserConfigurationException e) {
                } catch (SAXNotSupportedException e) {
                } catch (SAXNotRecognizedException e) {
                    // TIKA-271: Some XML parsers do not support the
                    // secure-processing feature, even though it's required by
                    // JAXP in Java 5. Ignoring the exception is fine here, as
                    // deployments without this feature are inherently vulnerable
                    // to XML denial-of-service attacks.
                }
            }
            return factory;
        }
    
    }

    从该类的源码可以看出,ParseContext类的主要作用是获取XML的SAX解析类SAXParser

    如果了解JAXP,上面的源码是很容易看懂的,Tika是采用SAX方式解析XML格式文档的,SAXParserFactory为抽象类,具体采用的哪个实现类呢,待分析

  • 相关阅读:
    软件工程之开发过程
    软件工程设计之四则运算
    Android笔记-5-EditText密码和Checkbox二选一
    Android笔记-4-实现登陆页面并跳转和简单的注册页面
    Android笔记-3-EditText的属性介绍
    Android笔记-2-TextView的属性详解
    Android笔记-1
    Microsoft Build 2015
    网络受限是个什么东东?
    几乎所有编程语言的hello, world程序(3)
  • 原文地址:https://www.cnblogs.com/chenying99/p/2948423.html
Copyright © 2020-2023  润新知