• SharePoint的搜索功能在多语言环境下是如何工作的?


    多语言情况下, 如何让某语言的内容可以正常的索引和查询到呢?

    站点内容, Office文档, TXT文档在不同语言环境下的索引和查询遵循什么原则呢?

    下面这段文字来自于SharePoint 2003的一片White Paper.

    经过笔者测试, 这段文字对多语言搜索功能的描述, 对SharePoint 2010而言依然正确无误.

    How Search Service Manages Multilingual Contents

    ===========================================

    The following sections provide information on how the SharePointPSSearch service treats content at index time and at query time, depending on the type of the content, its location, and user language settings.

     

    Language Considerations for SharePoint Portal Server Search Service

    -----------------------

    Indexing

    • For documents that have a locale ID (LCID) and a language ID that are supported by the SharePointPSSearch service (such as Microsoft Office documents), the SharePointPSSearch service uses the locale ID and a language ID that is associated with the document to identify the language resources that are used to index the document.
    • For documents that have an LCID and a language ID that are not supported by the SharePointPSSearch service, the neutral word breaker is used to index the document.
    • For documents that use a .txt extension, the SharePointPSSearch service uses the default locale of the indexing server to index the content.
    • For Windows SharePoint Services sites that are in the Site Directory, the SharePointPSSearch service uses the language of the Windows SharePoint Services site.
    • For sites based on SharePoint Portal Server, the SharePointPSSearch service uses the language of the site.

     

    Querying

    • Where possible, the SharePoint Portal Server Search service tries to identify the language of the user by using either the default language of the user's Web browser, or by using the optional LCID parameter in the query string to determine the language resources that are used to break the query terms.

    If the language ID is not supported, the neutral language resources are used.

     

    Language Considerations for Windows SharePoint Services Search Service

    -------------------

    Indexing

    • For text that is stored directly in list elements (and not stored in documents), the column language is used to determine the language resources that are used for indexing.
    • For documents that are stored in document libraries, the SharePointPSSearch service indexing rules apply.

    Querying

    The default language of the column is used to determine the language resources that are used to break the query terms.

     

    ===========================================================================

    下面的文字来自另一篇white paper, 名字叫Plan for building multilingual solutions, 其对search组件及过程的讲解堪称经典.

    Overview of the Language Features in Search

    ==============

    Content crawling and querying are different processes provided by Search. Each process occurs at different times and uses resources in different ways.

     

    Content crawling   

    The index engine uses a pipe of shared memory to request that the Filter Daemon begin filtering the content source. For the crawl process to succeed, the content source must have an associated protocol handler that can read its protocol.

     

    The Filter Daemon invokes the appropriate protocol handler for the content source based on the start address provided by the index engine. The Filter Daemon uses protocol handlers and IFilters to extract and filter individual items from the content source. Appropriate IFilters for each document are applied, and the Filter Daemon passes the extracted text and metadata to the index engine through the pipe.

     

    At this point, the index engine saves the document properties to a property store separate from that of the content index. The property store consists of a table of properties and their values. The properties in this store can be retrieved and sorted as needed. In addition, simple queries against the properties are supported by the store. Each row in the table corresponds to a separate document in the full-text index. The actual text of a content item is stored in the content index, so it can be used for content queries. The property store also maintains and enforces document-level security that are gathered when a document is crawled. After the initial crawl, the index engine uses word breakers and stemmers to further process the text and properties gathered during the crawl. The word breaker component is used to split the text into logical words and phrases. The index engine also removes "noise words" (that is, words that do not add value to a query) and creates an inverted index for full-text searching.

     

    Search query execution   

    When a search query is executed for a given language, the query engine passes that query to a word breaker for that language.

    If there is no word breaker for the language of the query, the neutral word breaker is used; the neutral word breaker uses white spaces to split the words.

    After the word-breaking process, results are passed, optionally, through a language-specific stemmer (when one is available). The stemming component is used to generate inflected forms of a given word, such as from a plural to a singular ("dogs" versus "dog") or the tense of a verb ("spoke" versus "spoken").

    The use of the word breaker in both the crawling and query processes enhances the effectiveness of Search because more relevant alternatives to a user’s query phrasing can be generated.

    When the query engine executes a property value query, the index is not touched and the query is performed directly against Microsoft SQL Server™ where properties are stored to ensure a proper match.

    The result of the query is a list of all matching documents, ordered by their relevance to the query words. If the user does not have permissions to a matching document, the query engine removes that document from the returned list.

    The following diagram provides a detailed view of the language features in Search at both index and query time:

    9-1-2010 4-24-03 PM

     
    Language-specific Features Provided and Used by the Search Service

    ===========

    Word breakers   

    A word breaker is a component used by the query and index engines to break compound words and phrases into individual words or tokens. If there is no word breaker for a specific language, the neutral word breaker is used, in which case word breaking occurs where there are white spaces between the words and phrases. At indexing time, if there is any locale information associated with the document (for example, a Word document contains locale information for each text chunk), the index engine will try to use the word breaker for that locale. If the document does not contain any locale information, the user locale of the computer the indexer is installed on is used instead. At query time, the locale (HTTP_ACCEPT_LANGUAGE) of the browser from which the query was sent is used to perform word breaking on the query. Additional information about the language availability of the word breaker component is available in Appendix B: Search Language Considerations.

     

    Noise words dictionary

    Noise words are words that do not add value to a query, such as ”and,” ”the,” and ”a.” The indexing engine filters them to save index space and to increase performance. Noise word files are customizable, language-specific text files. These files are a simple list of words, one per line. If a noise word file is changed, you must perform a full update of the index to incorporate the changes. Additional information about the noise words dictionary and how to customize it is available at www.microsoft.com.

     

    Custom dictionary

    The custom dictionary file contains values that the search server must include at index and query times. Custom dictionary lists are customizable, language-specific text files. These files are used by Search in both the index and query processes to identify exceptions to the noise word dictionaries. A word such as “AT&T,” for example, will never be indexed by default because the word breaker breaks it into single noise words. To avoid this, the user can add ”AT&T” to the custom dictionary file; as result, this word will be treated as an exception by the word breaker and will be indexed and queried. These files contain a simple list of words, one per line. If the custom dictionary file is changed, you must perform a full update of the index to incorporate the changes. By default, no custom dictionary file is installed during Office SharePoint Server 2007 Setup. Additional information about the custom dictionary file and how to customize it is available at www.microsoft.com.

     

    Thesaurus   

    There is a configurable thesaurus file for each language that Search supports. Using the thesaurus, you can specify synonyms for words and also automatically replace words in a query with other words that you specify. The thesaurus used will always be in the language of the query, not necessarily the server’s user locale. If a language-specific thesaurus is not available, a neutral thesaurus (tseneu.xml) is used. Additional information about the thesaurus file and how to customize it is available at www.microsoft.com.

     

    参考资料:

    Whitepaper (legacy): Using SPS 2003 in multilingual scenarios

     

    White paper: Plan for building multilingual solutions

    http://technet.microsoft.com/en-us/library/cc262942%28office.12%29.aspx

  • 相关阅读:
    java使用Websocket获取HttpSession出现的问题与解决
    java 静态导入 小结
    【编程思想笔记】内部类的初始化
    【搬运】Tea算法Java实现工具类
    OBS studio最新版配置鉴权推流
    debian 9 双显卡安装NVIDIA显卡驱动
    【学习笔记】Java finalize()的使用
    【学习笔记】js下拉刷新、上拉加载 mescroll框架的使用
    [随笔] 简单操作解决Google chrome颜色显示不正常的情况
    【自制工具类】struts返回json数据包装格式类
  • 原文地址:https://www.cnblogs.com/awpatp/p/1815014.html
Copyright © 2020-2023  润新知