• Apache Tika源码研究(四)


    上文分析了具体的解析类HtmlParser对网页文档的解析实现源码,了解到了Apache Tika的编码识别的处理方式。

    (HtmlParser对网页文件的解析其实并没有用到ParseContext上下文类的SAXParser对象,而是用到了另外一个TagSoup组件)

    本文继续分析Tika对xml格式文件SAX解析的事件处理相关类,精彩的部分留在后头吧

    jaxp规范定义了四个事件处理接口,分别是EntityResolver, DTDHandler, ContentHandler, ErrorHandler

    并提供了一个默认处理类DefaultHandler,实现了上面四个接口,这为我们扩展SAX的事件处理类提供了方便,只要继承该类即可。

    Apache Tika提供的事件处理类使用了装饰模式,里面的包装关系一层套一层,实在是看得眼花缭乱,下面的解析部分只对部分类解析,其他事件处理类类似,不再赘述。

    先来看看关键类的UML模型

    ContentHandlerDecorator类继承自JAXP的默认处理类DefaultHandler,从名称基本可以看出该类采用了装饰模式,下面是它的源码:

    /**
     * Decorator base class for the {@link ContentHandler} interface. This class
     * simply delegates all SAX events calls to an underlying decorated handler
     * instance. Subclasses can provide extra decoration by overriding one or more
     * of the SAX event methods.
     */
    public class ContentHandlerDecorator extends DefaultHandler {
    
        /**
         * Decorated SAX event handler.
         */
        private ContentHandler handler;
    
        /**
         * Creates a decorator for the given SAX event handler.
         *
         * @param handler SAX event handler to be decorated
         */
        public ContentHandlerDecorator(ContentHandler handler) {
            assert handler != null;
            this.handler = handler;
        }
    
        /**
         * Creates a decorator that by default forwards incoming SAX events to
         * a dummy content handler that simply ignores all the events. Subclasses
         * should use the {@link #setContentHandler(ContentHandler)} method to
         * switch to a more usable underlying content handler.
         */
        protected ContentHandlerDecorator() {
            this(new DefaultHandler());
        }
    
        /**
         * Sets the underlying content handler. All future SAX events will be
         * directed to this handler instead of the one that was previously used.
         *
         * @param handler content handler
         */
        protected void setContentHandler(ContentHandler handler) {
            assert handler != null;
            this.handler = handler;
        }
    
        @Override
        public void startPrefixMapping(String prefix, String uri)
                throws SAXException {
            try {
                handler.startPrefixMapping(prefix, uri);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endPrefixMapping(String prefix) throws SAXException {
            try {
                handler.endPrefixMapping(prefix);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void processingInstruction(String target, String data)
                throws SAXException {
            try {
                handler.processingInstruction(target, data);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void setDocumentLocator(Locator locator) {
            handler.setDocumentLocator(locator);
        }
    
        @Override
        public void startDocument() throws SAXException {
            try {
                handler.startDocument();
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endDocument() throws SAXException {
            try {
                handler.endDocument();
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void startElement(
                String uri, String localName, String name, Attributes atts)
                throws SAXException {
            try {
                handler.startElement(uri, localName, name, atts);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void endElement(String uri, String localName, String name)
                throws SAXException {
            try {
                handler.endElement(uri, localName, name);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void characters(char[] ch, int start, int length)
                throws SAXException {
            try {
                handler.characters(ch, start, length);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void ignorableWhitespace(char[] ch, int start, int length)
                throws SAXException {
            try {
                handler.ignorableWhitespace(ch, start, length);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public void skippedEntity(String name) throws SAXException {
            try {
                handler.skippedEntity(name);
            } catch (SAXException e) {
                handleException(e);
            }
        }
    
        @Override
        public String toString() {
            return handler.toString();
        }
    
        /**
         * Handle any exceptions thrown by methods in this class. This method
         * provides a single place to implement custom exception handling. The
         * default behaviour is simply to re-throw the given exception, but
         * subclasses can also provide alternative ways of handling the situation.
         *
         * @param exception the exception that was thrown
         * @throws SAXException the exception (if any) thrown to the client
         */
        protected void handleException(SAXException exception) throws SAXException {
            throw exception;
        }
    
    }

    该装饰类持有ContentHandler对象的引用,其后相关的方法都是调用了ContentHandler的同名方法

    接下来看具体的装饰类BodyContentHandler的源码

    /**
     * Content handler decorator that only passes everything inside
     * the XHTML <body/> tag to the underlying handler. Note that
     * the &lt;body/&gt; tag itself is <em>not</em> passed on.
     */
    public class BodyContentHandler extends ContentHandlerDecorator {
    
        /**
         * XHTML XPath parser.
         */
        private static final XPathParser PARSER =
            new XPathParser("xhtml", XHTMLContentHandler.XHTML);
    
        /**
         * The XPath matcher used to select the XHTML body contents.
         */
        private static final Matcher MATCHER =
            PARSER.parse("/xhtml:html/xhtml:body/descendant::node()");
    
        /**
         * Creates a content handler that passes all XHTML body events to the
         * given underlying content handler.
         *
         * @param handler content handler
         */
        public BodyContentHandler(ContentHandler handler) {
            super(new MatchingContentHandler(handler, MATCHER));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * the given writer.
         *
         * @param writer writer
         */
        public BodyContentHandler(Writer writer) {
            this(new WriteOutContentHandler(writer));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * the given output stream using the default encoding.
         *
         * @param stream output stream
         */
        public BodyContentHandler(OutputStream stream) {
            this(new WriteOutContentHandler(stream));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * an internal string buffer. The contents of the buffer can be retrieved
         * using the {@link #toString()} method.
         * <p>
         * The internal string buffer is bounded at the given number of characters.
         * If this write limit is reached, then a {@link SAXException} is thrown.
         *
         * @since Apache Tika 0.7
         * @param writeLimit maximum number of characters to include in the string,
         *                   or -1 to disable the write limit
         */
        public BodyContentHandler(int writeLimit) {
            this(new WriteOutContentHandler(writeLimit));
        }
    
        /**
         * Creates a content handler that writes XHTML body character events to
         * an internal string buffer. The contents of the buffer can be retrieved
         * using the {@link #toString()} method.
         * <p>
         * The internal string buffer is bounded at 100k characters. If this write
         * limit is reached, then a {@link SAXException} is thrown.
         */
        public BodyContentHandler() {
            this(new WriteOutContentHandler());
        }
    
    }

    最后是用过调用父类的构造函数初始化被装饰的对象

  • 相关阅读:
    oracle11g 卸载和安装(win7,32位)
    MySQL忘记密码解决办法
    GPIO硬件资源的申请,内核空间和用户空间的数据交换,ioctl(.....),设备文件的自动创建
    模块参数,系统调用,字符设备编程重要数据结构,设备号的申请与注册,关于cdev的API
    开发环境的搭建,符合导出,打印优先级阈值
    定时器中断
    Linux系统移植的重要文件
    linux 相关指令
    linux各文件夹含义和作用
    外部中断实验
  • 原文地址:https://www.cnblogs.com/chenying99/p/2949160.html
Copyright © 2020-2023  润新知