• lucene源码分析(2)读取过程实例


    1.官方提供的代码demo

            Analyzer analyzer = new StandardAnalyzer();
    
            // Store the index in memory:
            Directory directory = new RAMDirectory();
            // To store an index on disk, use this instead:
            //Directory directory = FSDirectory.open("/tmp/testindex");
            IndexWriterConfig config = new IndexWriterConfig(analyzer);
            IndexWriter iwriter = new IndexWriter(directory, config);
            Document doc = new Document();
            String text = "This is the text to be indexed.";
            doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
            iwriter.addDocument(doc);
            iwriter.close();

    2.涉及到的类及其关系

    2.1 TokenStream

    /**
     * A <code>TokenStream</code> enumerates the sequence of tokens, either from
     * {@link Field}s of a {@link Document} or from query text.
     * <p>
     * This is an abstract class; concrete subclasses are:
     * <ul>
     * <li>{@link Tokenizer}, a <code>TokenStream</code> whose input is a Reader; and
     * <li>{@link TokenFilter}, a <code>TokenStream</code> whose input is another
     * <code>TokenStream</code>.
     * </ul>
     * A new <code>TokenStream</code> API has been introduced with Lucene 2.9. This API
     * has moved from being {@link Token}-based to {@link Attribute}-based. While
     * {@link Token} still exists in 2.9 as a convenience class, the preferred way
     * to store the information of a {@link Token} is to use {@link AttributeImpl}s.
     * <p>
     * <code>TokenStream</code> now extends {@link AttributeSource}, which provides
     * access to all of the token {@link Attribute}s for the <code>TokenStream</code>.
     * Note that only one instance per {@link AttributeImpl} is created and reused
     * for every token. This approach reduces object creation and allows local
     * caching of references to the {@link AttributeImpl}s. See
     * {@link #incrementToken()} for further details.
     * <p>
     * <b>The workflow of the new <code>TokenStream</code> API is as follows:</b>
     * <ol>
     * <li>Instantiation of <code>TokenStream</code>/{@link TokenFilter}s which add/get
     * attributes to/from the {@link AttributeSource}.
     * <li>The consumer calls {@link TokenStream#reset()}.
     * <li>The consumer retrieves attributes from the stream and stores local
     * references to all attributes it wants to access.
     * <li>The consumer calls {@link #incrementToken()} until it returns false
     * consuming the attributes after each call.
     * <li>The consumer calls {@link #end()} so that any end-of-stream operations
     * can be performed.
     * <li>The consumer calls {@link #close()} to release any resource when finished
     * using the <code>TokenStream</code>.
     * </ol>
     * To make sure that filters and consumers know which attributes are available,
     * the attributes must be added during instantiation. Filters and consumers are
     * not required to check for availability of attributes in
     * {@link #incrementToken()}.
     * <p>
     * You can find some example code for the new API in the analysis package level
     * Javadoc.
     * <p>
     * Sometimes it is desirable to capture a current state of a <code>TokenStream</code>,
     * e.g., for buffering purposes (see {@link CachingTokenFilter},
     * TeeSinkTokenFilter). For this usecase
     * {@link AttributeSource#captureState} and {@link AttributeSource#restoreState}
     * can be used.
     * <p>The {@code TokenStream}-API in Lucene is based on the decorator pattern.
     * Therefore all non-abstract subclasses must be final or have at least a final
     * implementation of {@link #incrementToken}! This is checked when Java
     * assertions are enabled.
     */

    2.2 Analyzer

    /**
     * An Analyzer builds TokenStreams, which analyze text.  It thus represents a
     * policy for extracting index terms from text.
     * <p>
     * In order to define what analysis is done, subclasses must define their
     * {@link TokenStreamComponents TokenStreamComponents} in {@link #createComponents(String)}.
     * The components are then reused in each call to {@link #tokenStream(String, Reader)}.
     * <p>
     * Simple example:
     * <pre class="prettyprint">
     * Analyzer analyzer = new Analyzer() {
     *  {@literal @Override}
     *   protected TokenStreamComponents createComponents(String fieldName) {
     *     Tokenizer source = new FooTokenizer(reader);
     *     TokenStream filter = new FooFilter(source);
     *     filter = new BarFilter(filter);
     *     return new TokenStreamComponents(source, filter);
     *   }
     *   {@literal @Override}
     *   protected TokenStream normalize(TokenStream in) {
     *     // Assuming FooFilter is about normalization and BarFilter is about
     *     // stemming, only FooFilter should be applied
     *     return new FooFilter(in);
     *   }
     * };
     * </pre>
     * For more examples, see the {@link org.apache.lucene.analysis Analysis package documentation}.
     * <p>
     * For some concrete implementations bundled with Lucene, look in the analysis modules:
     * <ul>
     *   <li><a href="{@docRoot}/../analyzers-common/overview-summary.html">Common</a>:
     *       Analyzers for indexing content in different languages and domains.
     *   <li><a href="{@docRoot}/../analyzers-icu/overview-summary.html">ICU</a>:
     *       Exposes functionality from ICU to Apache Lucene. 
     *   <li><a href="{@docRoot}/../analyzers-kuromoji/overview-summary.html">Kuromoji</a>:
     *       Morphological analyzer for Japanese text.
     *   <li><a href="{@docRoot}/../analyzers-morfologik/overview-summary.html">Morfologik</a>:
     *       Dictionary-driven lemmatization for the Polish language.
     *   <li><a href="{@docRoot}/../analyzers-phonetic/overview-summary.html">Phonetic</a>:
     *       Analysis for indexing phonetic signatures (for sounds-alike search).
     *   <li><a href="{@docRoot}/../analyzers-smartcn/overview-summary.html">Smart Chinese</a>:
     *       Analyzer for Simplified Chinese, which indexes words.
     *   <li><a href="{@docRoot}/../analyzers-stempel/overview-summary.html">Stempel</a>:
     *       Algorithmic Stemmer for the Polish Language.
     *   <li><a href="{@docRoot}/../analyzers-uima/overview-summary.html">UIMA</a>: 
     *       Analysis integration with Apache UIMA. 
     * </ul>
     */

    2.3 Directory

    /** A Directory is a flat list of files.  Files may be written once, when they
     * are created.  Once a file is created it may only be opened for read, or
     * deleted.  Random access is permitted both when reading and writing.
     *
     * <p> Java's i/o APIs not used directly, but rather all i/o is
     * through this API.  This permits things such as: <ul>
     * <li> implementation of RAM-based indices;
     * <li> implementation indices stored in a database, via JDBC;
     * <li> implementation of an index as a single file;
     * </ul>
     *
     * Directory locking is implemented by an instance of {@link
     * LockFactory}.
     *
     */

    2.4 IndexWriter

    /**
      An <code>IndexWriter</code> creates and maintains an index.
    
      <p>The {@link OpenMode} option on 
      {@link IndexWriterConfig#setOpenMode(OpenMode)} determines 
      whether a new index is created, or whether an existing index is
      opened. Note that you can open an index with {@link OpenMode#CREATE}
      even while readers are using the index. The old readers will 
      continue to search the "point in time" snapshot they had opened, 
      and won't see the newly created index until they re-open. If 
      {@link OpenMode#CREATE_OR_APPEND} is used IndexWriter will create a 
      new index if there is not already an index at the provided path
      and otherwise open the existing index.</p>
    
      <p>In either case, documents are added with {@link #addDocument(Iterable)
      addDocument} and removed with {@link #deleteDocuments(Term...)} or {@link
      #deleteDocuments(Query...)}. A document can be updated with {@link
      #updateDocument(Term, Iterable) updateDocument} (which just deletes
      and then adds the entire document). When finished adding, deleting 
      and updating documents, {@link #close() close} should be called.</p>
    
      <a name="sequence_numbers"></a>
      <p>Each method that changes the index returns a {@code long} sequence number, which
      expresses the effective order in which each change was applied.
      {@link #commit} also returns a sequence number, describing which
      changes are in the commit point and which are not.  Sequence numbers
      are transient (not saved into the index in any way) and only valid
      within a single {@code IndexWriter} instance.</p>
    
      <a name="flush"></a>
      <p>These changes are buffered in memory and periodically
      flushed to the {@link Directory} (during the above method
      calls). A flush is triggered when there are enough added documents
      since the last flush. Flushing is triggered either by RAM usage of the
      documents (see {@link IndexWriterConfig#setRAMBufferSizeMB}) or the
      number of added documents (see {@link IndexWriterConfig#setMaxBufferedDocs(int)}).
      The default is to flush when RAM usage hits
      {@link IndexWriterConfig#DEFAULT_RAM_BUFFER_SIZE_MB} MB. For
      best indexing speed you should flush by RAM usage with a
      large RAM buffer. Additionally, if IndexWriter reaches the configured number of
      buffered deletes (see {@link IndexWriterConfig#setMaxBufferedDeleteTerms})
      the deleted terms and queries are flushed and applied to existing segments.
      In contrast to the other flush options {@link IndexWriterConfig#setRAMBufferSizeMB} and 
      {@link IndexWriterConfig#setMaxBufferedDocs(int)}, deleted terms
      won't trigger a segment flush. Note that flushing just moves the
      internal buffered state in IndexWriter into the index, but
      these changes are not visible to IndexReader until either
      {@link #commit()} or {@link #close} is called.  A flush may
      also trigger one or more segment merges which by default
      run with a background thread so as not to block the
      addDocument calls (see <a href="#mergePolicy">below</a>
      for changing the {@link MergeScheduler}).</p>
    
      <p>Opening an <code>IndexWriter</code> creates a lock file for the directory in use. Trying to open
      another <code>IndexWriter</code> on the same directory will lead to a
      {@link LockObtainFailedException}.</p>
      
      <a name="deletionPolicy"></a>
      <p>Expert: <code>IndexWriter</code> allows an optional
      {@link IndexDeletionPolicy} implementation to be specified.  You
      can use this to control when prior commits are deleted from
      the index.  The default policy is {@link KeepOnlyLastCommitDeletionPolicy}
      which removes all prior commits as soon as a new commit is
      done.  Creating your own policy can allow you to explicitly
      keep previous "point in time" commits alive in the index for
      some time, either because this is useful for your application,
      or to give readers enough time to refresh to the new commit
      without having the old commit deleted out from under them.
      The latter is necessary when multiple computers take turns opening
      their own {@code IndexWriter} and {@code IndexReader}s
      against a single shared index mounted via remote filesystems
      like NFS which do not support "delete on last close" semantics.
      A single computer accessing an index via NFS is fine with the
      default deletion policy since NFS clients emulate "delete on
      last close" locally.  That said, accessing an index via NFS
      will likely result in poor performance compared to a local IO
      device. </p>
    
      <a name="mergePolicy"></a> <p>Expert:
      <code>IndexWriter</code> allows you to separately change
      the {@link MergePolicy} and the {@link MergeScheduler}.
      The {@link MergePolicy} is invoked whenever there are
      changes to the segments in the index.  Its role is to
      select which merges to do, if any, and return a {@link
      MergePolicy.MergeSpecification} describing the merges.
      The default is {@link LogByteSizeMergePolicy}.  Then, the {@link
      MergeScheduler} is invoked with the requested merges and
      it decides when and how to run the merges.  The default is
      {@link ConcurrentMergeScheduler}. </p>
    
      <a name="OOME"></a><p><b>NOTE</b>: if you hit a
      VirtualMachineError, or disaster strikes during a checkpoint
      then IndexWriter will close itself.  This is a
      defensive measure in case any internal state (buffered
      documents, deletions, reference counts) were corrupted.  
      Any subsequent calls will throw an AlreadyClosedException.</p>
    
      <a name="thread-safety"></a><p><b>NOTE</b>: {@link
      IndexWriter} instances are completely thread
      safe, meaning multiple threads can call any of its
      methods, concurrently.  If your application requires
      external synchronization, you should <b>not</b>
      synchronize on the <code>IndexWriter</code> instance as
      this may cause deadlock; use your own (non-Lucene) objects
      instead. </p>
      
      <p><b>NOTE</b>: If you call
      <code>Thread.interrupt()</code> on a thread that's within
      IndexWriter, IndexWriter will try to catch this (eg, if
      it's in a wait() or Thread.sleep()), and will then throw
      the unchecked exception {@link ThreadInterruptedException}
      and <b>clear</b> the interrupt status on the thread.</p>
    */
    
    /*
     * Clarification: Check Points (and commits)
     * IndexWriter writes new index files to the directory without writing a new segments_N
     * file which references these new files. It also means that the state of
     * the in memory SegmentInfos object is different than the most recent
     * segments_N file written to the directory.
     *
     * Each time the SegmentInfos is changed, and matches the (possibly
     * modified) directory files, we have a new "check point".
     * If the modified/new SegmentInfos is written to disk - as a new
     * (generation of) segments_N file - this check point is also an
     * IndexCommit.
     *
     * A new checkpoint always replaces the previous checkpoint and
     * becomes the new "front" of the index. This allows the IndexFileDeleter
     * to delete files that are referenced only by stale checkpoints.
     * (files that were created since the last commit, but are no longer
     * referenced by the "front" of the index). For this, IndexFileDeleter
     * keeps track of the last non commit checkpoint.
     */
  • 相关阅读:
    tomcat的OutOfMemoryError内存溢出解决方法
    转:动态table分页(ORCALE)
    转: 根据屏幕分辨率,浏览器调用不同css
    转:只能选择GridView中的一个CheckBox(单选CheckBox)
    转:tomcat安全设置
    Tomcat内存设置详解
    Dos命令删除添加新服务
    卸载oracle 10g
    转:oracle:win7手工卸载oracle数据库11g
    win7 下安装oracle 10 g
  • 原文地址:https://www.cnblogs.com/davidwang456/p/9935786.html
Copyright © 2020-2023  润新知