【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析
请先參见“集成Nutch/Hbase/Solr构建搜索引擎之中的一个：安装及执行”。搭建測试环境

http://blog.csdn.net/jediael_lu/article/details/37329731

一、被索引的域 Schema.xml

1、文档基本内容

在使用solr对Nutch抓取到的网页进行索引时，schema.xml被替换成下面内容。

文件里指定了哪些域被索引、存储等内容。
```
<?xml version="1.0" encoding="UTF-8" ?
>
   
<schema name="nutch" version="1.5">
    <types>
        <fieldType name="string" class="solr.StrField" sortMissingLast="true"
            omitNorms="true"/> 
        <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="float" class="solr.TrieFloatField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>
        <fieldType name="date" class="solr.TrieDateField" precisionStep="0"
            omitNorms="true" positionIncrementGap="0"/>


        <fieldType name="text" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory"
                    ignoreCase="true" words="stopwords.txt"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0"
                    splitOnCaseChange="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
            </analyzer>
        </fieldType>
        <fieldType name="url" class="solr.TextField"
            positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.WordDelimiterFilterFactory"
                    generateWordParts="1" generateNumberParts="1"/>
            </analyzer>
        </fieldType>
    </types>
    <fields>
        <field name="id" type="string" stored="true" indexed="true"/>
<field name="_version_" type="long" indexed="true" stored="true"/>
        
        <field name="batchId" type="string" stored="true" indexed="false"/>
        <field name="digest" type="string" stored="true" indexed="false"/>
        <field name="boost" type="float" stored="true" indexed="false"/>


        
        <field name="host" type="url" stored="false" indexed="true"/>
        <field name="url" type="url" stored="true" indexed="true"
            required="true"/>
        <field name="content" type="text" stored="false" indexed="true"/>
        <field name="title" type="text" stored="true" indexed="true"/>
        <field name="cache" type="string" stored="true" indexed="false"/>
        <field name="tstamp" type="date" stored="true" indexed="false"/>


        
        <field name="anchor" type="string" stored="true" indexed="true"
            multiValued="true"/>


        
        <field name="type" type="string" stored="true" indexed="true"
            multiValued="true"/>
        <field name="contentLength" type="long" stored="true"
            indexed="false"/>
        <field name="lastModified" type="date" stored="true"
            indexed="false"/>
        <field name="date" type="date" stored="true" indexed="true"/>


        
        <field name="lang" type="string" stored="true" indexed="true"/>


        
        <field name="subcollection" type="string" stored="true"
            indexed="true" multiValued="true"/>


        
        <field name="author" type="string" stored="true" indexed="true"/>
        <field name="tag" type="string" stored="true" indexed="true" multiValued="true"/>
        <field name="feed" type="string" stored="true" indexed="true"/>
        <field name="publishedDate" type="date" stored="true"
            indexed="true"/>
        <field name="updatedDate" type="date" stored="true"
            indexed="true"/>


        
        <field name="cc" type="string" stored="true" indexed="true"
            multiValued="true"/>
            
            
        <field name="tld" type="string" stored="false" indexed="false"/>
    </fields>
    <uniqueKey>id</uniqueKey>
    <defaultSearchField>content</defaultSearchField>
    <solrQueryParser defaultOperator="OR"/>
</schema>
```
分析上述文件，主要指定了下面内容：

（1）FiledType：域的类型

（2）Field：哪些域被索引、存储等，以及这个域是什么类型。

（3）uniqueKey：哪个域作为id，即文章的唯一标识。

（4）defaultSearchField：默认的搜索域

（5）solrQueryParser：OR。即使用OR来构建Query。

2、Fileds属性分析

（1）Fileds中有2个Field是必填值
- uniqueKey中指定的：id
- 使用了required指定的：url
（2）Fields中有8个基本Field，经过索引后，它们的状态例如以下：

由上可见，boost与tstamp是不被索引的，但被存储了，因此不能够被搜索，但能够呈现。

而content则被索引，但不被存储，因此能够搜索。但不能够在结果中呈现。

二、Nutch抓取的基本流程
```
关于更具体的抓取流程说明。请见 Nutch2.2.1抓取流程
```
```
./bin/crawl urls csdnitpub http://ip:8983/solr/ 5
```
```
 crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>
```
```
<seedDir>：放置种子文件的文件夹
```
```
<crawlID> ：这个抓取任务的ID
```
```
<solrURL>：用于索引及搜索的solr地址
```
```
<numberOfRounds>：迭代次数，即抓取深度
```
```
一个nutch抓取流程例如以下：
```
```
（1）InjectorJob
```
```
開始第一个迭代
```
```
（2）GeneratorJob
```
```
（3）FetcherJob
```
```
（4）ParserJob
```
```
（5）DbUpdaterJob
```
```
（6）SolrIndexerJob
```
```
開始第二个迭代
```
```
（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob
```
```
開始第三个迭代
```
```
……
```
```
一个典型任务的日志例如以下：
```
```
InjectorJob: starting at 2014-07-08 10:41:27
```
```
InjectorJob: Injecting urlDir: urls
```
```
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
```
```
InjectorJob: total number of urls rejected by filters: 0
```
```
InjectorJob: total number of urls injected after normalization and filtering: 2
```
```
Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
```
```
Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
```
```
Generating batchId
```
```
Generating a new fetchlist
```
```
GeneratorJob: starting at 2014-07-08 10:41:34
```
```
GeneratorJob: Selecting best-scoring urls due for fetch.
```
```
GeneratorJob: starting
```
```
GeneratorJob: filtering: false
```
```
GeneratorJob: normalizing: false
```
```
GeneratorJob: topN: 50000
```
```
GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
```
```
GeneratorJob: generated batch id: 1404787293-26339
```
```
Fetching : 
```
```
FetcherJob: starting
```
```
FetcherJob: batchId: 1404787293-26339
```
```
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
```
```
FetcherJob: threads: 50
```
```
FetcherJob: parsing: false
```
```
FetcherJob: resuming: false
```
```
FetcherJob : timelimit set for : 1404798101129
```
```
Using queue mode : byHost
```
```
Fetcher: threads: 50
```
```
QueueFeeder finished: total 2 records. Hit by time limit :0
```
```
fetching http://www.csdn.net/ (queue crawl delay=5000ms)
```
```
Fetcher: throughput threshold: -1
```
```
Fetcher: throughput threshold sequence: 5
```
```
fetching http://www.itpub.net/ (queue crawl delay=5000ms)
```
```
-finishing thread FetcherThread47, activeThreads=48
```
```
-finishing thread FetcherThread46, activeThreads=47
```
```
-finishing thread FetcherThread45, activeThreads=46
```
```
-finishing thread FetcherThread44, activeThreads=45
```
```
-finishing thread FetcherThread43, activeThreads=44
```
```
-finishing thread FetcherThread42, activeThreads=43
```
```
-finishing thread FetcherThread41, activeThreads=42
```
```
-finishing thread FetcherThread40, activeThreads=41
```
```
-finishing thread FetcherThread39, activeThreads=40
```
```
-finishing thread FetcherThread38, activeThreads=39
```
```
-finishing thread FetcherThread37, activeThreads=38
```
```
-finishing thread FetcherThread36, activeThreads=37
```
```
-finishing thread FetcherThread35, activeThreads=36
```
```
-finishing thread FetcherThread34, activeThreads=35
```
```
-finishing thread FetcherThread33, activeThreads=34
```
```
-finishing thread FetcherThread32, activeThreads=33
```
```
-finishing thread FetcherThread31, activeThreads=32
```
```
-finishing thread FetcherThread30, activeThreads=31
```
```
-finishing thread FetcherThread29, activeThreads=30
```
```
-finishing thread FetcherThread48, activeThreads=29
```
```
-finishing thread FetcherThread27, activeThreads=29
```
```
-finishing thread FetcherThread26, activeThreads=28
```
```
-finishing thread FetcherThread25, activeThreads=27
```
```
-finishing thread FetcherThread24, activeThreads=26
```
```
-finishing thread FetcherThread23, activeThreads=25
```
```
-finishing thread FetcherThread22, activeThreads=24
```
```
-finishing thread FetcherThread21, activeThreads=23
```
```
-finishing thread FetcherThread20, activeThreads=22
```
```
-finishing thread FetcherThread19, activeThreads=21
```
```
-finishing thread FetcherThread18, activeThreads=20
```
```
-finishing thread FetcherThread17, activeThreads=19
```
```
-finishing thread FetcherThread16, activeThreads=18
```
```
-finishing thread FetcherThread15, activeThreads=17
```
```
-finishing thread FetcherThread14, activeThreads=16
```
```
-finishing thread FetcherThread13, activeThreads=15
```
```
-finishing thread FetcherThread12, activeThreads=14
```
```
-finishing thread FetcherThread11, activeThreads=13
```
```
-finishing thread FetcherThread10, activeThreads=12
```
```
-finishing thread FetcherThread9, activeThreads=11
```
```
-finishing thread FetcherThread8, activeThreads=10
```
```
-finishing thread FetcherThread7, activeThreads=9
```
```
-finishing thread FetcherThread5, activeThreads=8
```
```
-finishing thread FetcherThread4, activeThreads=7
```
```
-finishing thread FetcherThread3, activeThreads=6
```
```
-finishing thread FetcherThread2, activeThreads=5
```
```
-finishing thread FetcherThread49, activeThreads=4
```
```
-finishing thread FetcherThread6, activeThreads=3
```
```
-finishing thread FetcherThread28, activeThreads=2
```
```
-finishing thread FetcherThread0, activeThreads=1
```
```
fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
```
```
-finishing thread FetcherThread1, activeThreads=0
```
```
0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
```
```
-activeThreads=0
```
```
FetcherJob: done
```
```
Parsing : 
```
```
ParserJob: starting
```
```
ParserJob: resuming:    false
```
```
ParserJob: forced reparse:      false
```
```
ParserJob: batchId:     1404787293-26339
```
```
Parsing http://www.csdn.net/
```
```
http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
```
```
Parsing http://www.itpub.net/
```
```
ParserJob: success
```
```
CrawlDB update for csdnitpub
```
```
DbUpdaterJob: starting
```
```
DbUpdaterJob: done
```
```
Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
```
```
SolrIndexerJob: starting
```
```
SolrIndexerJob: done.
```
```
SOLR dedup -> http://ip:8983/solr/
```
```
Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
```
```
Generating batchId
```
```
Generating a new fetchlist
```
```
GeneratorJob: starting at 2014-07-08 10:42:19
```
```
GeneratorJob: Selecting best-scoring urls due for fetch.
```
```
GeneratorJob: starting
```
```
GeneratorJob: filtering: false
```
```
GeneratorJob: normalizing: false
```
```
GeneratorJob: topN: 50000
```
```
GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
```
```
GeneratorJob: generated batch id: 1404787338-30453
```
```
Fetching : 
```
```
FetcherJob: starting
```
```
FetcherJob: batchId: 1404787338-30453
```
```
Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
```
```
FetcherJob: threads: 50
```
```
FetcherJob: parsing: false
```
```
FetcherJob: resuming: false
```
```
FetcherJob : timelimit set for : 1404798146676
```
```
Using queue mode : byHost
```
```
Fetcher: threads: 50
```
```
QueueFeeder finished: total 0 records. Hit by time limit :0
```
附一些说明：

Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.

A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:
• GeneratorJob
• FetcherJob
• ParserJob (optionally done while fetching using 'fetch.parse')
• DbUpdaterJob
Additionally, the following processes need to be understood:
• InjectorJob
• Invertlinks
• Indexing with Apache Solr
First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.

Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which
every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and
form a group. In most NoSQL stores, row keys are sorted and give an advantage.
Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:
• org.apache..www:http/
• org.apache.gora:http/
Let's define each step in depth so that we can understand crawling step-by-step.
Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:
• A crawl_generate job will be used for a set of URLs to be fetched
• A crawl_fetch job will contain the status of fetching each URL
• A content will contain the content of rows retrieved from every URL
Now let's understand each job of crawling in detail.

三、与抓取相关的一些hbase操作

（1）进入bhase shell

[root@jediael44 hbase-0.90.4]# ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011

（2）查看全部表
hbase(main):001:0> list
TABLE
20140710_webpage
TestCrawl_webpage
csdnitpub_webpage
stackoverflow_webpage
4 row(s) in 0.7620 seconds

（3）查看某个表的结构
hbase(main):002:0> describe '20140710_webpage'
DESCRIPTION ENABLED
{NAME => '20140710_webpage', FAMILIES => [{NAME => 'f', BLOOMFILTER => 'NONE', true
REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474
83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAM
E => 'h', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COM
PRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'fa
lse', BLOCKCACHE => 'true'}, {NAME => 'il', BLOOMFILTER => 'NONE', REPLICATION_
SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOC
KSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'mk', B
LOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION =>
'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCK
CACHE => 'true'}, {NAME => 'mtdt', BLOOMFILTER => 'NONE', REPLICATION_SCOPE =>
'0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
'65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'ol', BLOOMFILTE
R => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE',
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE =>
'true'}, {NAME => 'p', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSION
S => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_
MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 's', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'NONE', TTL => '21474
83647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0880 seconds

（4）查看某列的HTML文件

hbase(main):010:0* get '20140710_webpage', 'ar.com.admival.www:http/', 'f:cnt'
COLUMN CELL
f:cnt timestamp=1404983104070, value=x0Dx0A<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:
o="urn:schemas-microsoft-com:office:office" xmlns="http://www.w3.org/TR/REC-html40">x0Dx0
Ax0Dx0A<head>x0Dx0A<meta http-equiv="Content-Type" content="text/html; charset=windows-
1252">x0Dx0A<meta http-equiv="Content-Language" content="es">x0Dx0Ax0Dx0A<title>CAMBI
O DE CHEQUES | cambio de cheque | cheque | cheques | compra de cheques de terceros | cambio
cheques | cheques no a la orden </title>x0Dx0A<link rel="shortcut icon" ref="favivon.ico
" />x0Dx0A<meta name="GENERATOR" content="Microsoft FrontPage 5.0">x0Dx0A<meta name="Pr
ogId" content="FrontPage.Editor.Document">x0Dx0A<meta name=keywords content="cambio de ch
eques, compro cheques, canje de cheques, Compro cheques de terceros, cambio cheques">x0Dx
0A<meta name="description" http-equiv="description" content="Admival.com.ar cambia por efec
tivo sus cheques de terceros, posdatados, al dxECa, de Sociedades, personales, cheques no
a la orden, cheques de indemnizacixF3n por despido, etc. " /> x0Dx0A<meta name="robots"
content="index,follow">

未完……

当中行的名称能够通过scan '20140710_webpage'得到。

四、ant runtime所构建的内容

1、构建Nutch

tar -zxvf apache-nutch-2.2.1-src.tar.gz

cd apache-nutch-2.2.1

ant runtime

2、    ant构建之后，生成runtime目录，该目录以下有deploy和local目录，分别代表了nutch的两种执行方式：

Deploy：的数据必须执行在Hadoop的HDFS中

local：是执行在本地文件夹中。

（1）二者的文件夹结构例如以下：

[jediael@jediael44 runtime]$ ls deploy/ local/

deploy/:

apache-nutch-2.2.1.job bin

local/:

bin conf lib logs plugins test

在deploy中，文件被打包成一个Job，作为Hadoop的一个Job来执行。

（2）二者文件夹下均有一个bin的文件夹。其内包括同样的crawl与nutch两个运行文件。

我们查看nutch文件的最后几行

if $local; then

# fix for the external Xerceslib issue with SAXParserFactory

NUTCH_OPTS="-Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl$NUTCH_OPTS"

EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"

else

# check that hadoop can befound on the path

if [ $(which hadoop | wc -l )-eq 0 ]; then

    echo "Can't findHadoop executable. Add HADOOP_HOME/bin to the path or run in local mode."

    exit -1;

fi

# distributed mode

EXEC_CALL="hadoop jar$NUTCH_JOB"

fi

# run it

exec $EXEC_CALL $CLASS "$@"

即默认情况下为 EXEC_CALL="hadoop jar$NUTCH_JOB"。若为Local，则 EXEC_CALL="$JAVA$JAVA_HEAP_MAX $NUTCH_OPTS -classpath $CLASSPATH"。若未local，且hadoop不存在。则报错。

（3）依据參数确定类文件

if [ "$COMMAND" = "crawl" ] ; then
CLASS=org.apache.nutch.crawl.Crawler
elif [ "$COMMAND" = "inject" ] ; then
CLASS=org.apache.nutch.crawl.InjectorJob
elif [ "$COMMAND" = "hostinject" ] ; then
CLASS=org.apache.nutch.host.HostInjectorJob
elif [ "$COMMAND" = "generate" ] ; then
CLASS=org.apache.nutch.crawl.GeneratorJob
elif [ "$COMMAND" = "fetch" ] ; then
CLASS=org.apache.nutch.fetcher.FetcherJob
elif [ "$COMMAND" = "parse" ] ; then
CLASS=org.apache.nutch.parse.ParserJob
elif [ "$COMMAND" = "updatedb" ] ; then
CLASS=org.apache.nutch.crawl.DbUpdaterJob
elif [ "$COMMAND" = "updatehostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbUpdateJob
elif [ "$COMMAND" = "readdb" ] ; then
CLASS=org.apache.nutch.crawl.WebTableReader
elif [ "$COMMAND" = "readhostdb" ] ; then
CLASS=org.apache.nutch.host.HostDbReader
elif [ "$COMMAND" = "elasticindex" ] ; then
CLASS=org.apache.nutch.indexer.elastic.ElasticIndexerJob
elif [ "$COMMAND" = "solrindex" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrIndexerJob
elif [ "$COMMAND" = "solrdedup" ] ; then
CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates
elif [ "$COMMAND" = "parsechecker" ] ; then
CLASS=org.apache.nutch.parse.ParserChecker
elif [ "$COMMAND" = "indexchecker" ] ; then
CLASS=org.apache.nutch.indexer.IndexingFiltersChecker
elif [ "$COMMAND" = "plugin" ] ; then
CLASS=org.apache.nutch.plugin.PluginRepository

如。对于nutch fetch命令，相应的类文件应该是：org.apache.nutch.fetcher.FetcherJob

[jediael@jediael44 java]$ cat org/apache/nutch/fetcher/FetcherJob.java

能够查看类文件。
此方法能够查看一切的shell相应的源文件。

五、改动Solr中browse返回内容

1、从content域中搜索

从solr的example中得到的solrConfig.xml中，qf的定义例如以下：
[html] view plain copy

<str name="qf">
   text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

因为content不占不论什么的权重，因此假设某个文档仅仅在content中包括keyword的话，搜索结果并不会返回这个文档。因此，对于nutch提取的索引来说，要添加content的权重，以及url的权重（假设须要的话）：
[html] view plain copy

<str name="qf">
   content^1.0 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>
2、保存网页的content内容

将schema.xml中的
```
 <field name="content" type="text" stored="false" indexed="true"/>
```
改为
```
        <field name="content" type="text" stored="true" indexed="true"/>
```
3、同一时候显示网页文件与一般文本

velocity/results_list.vm
```
##parse("hit_plain.vm")
```
将凝视去掉。

4、调整每一个搜索返回项的显示内容

vi richtest_doc.vm
```
<div>
  Id: #field('id')
</div>
```
改成：
```
<div>
  time: #field('tstamp')
</div>
<div>
  score: #field('score')
</div>
```
这种方法能够改动其他字段，详见http://blog.csdn.net/jediael_lu/article/details/38039267
1、从content域中搜索

从solr的example中得到的solrConfig.xml中，qf的定义例如以下：

[html] view plain copy

<str name="qf">
   text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

因为content不占不论什么的权重，因此假设某个文档仅仅在content中包括keyword的话。搜索结果并不会返回这个文档。因此，对于nutch提取的索引来说，要添加content的权重，以及url的权重（假设须要的话）：

[html] view plain copy

<str name="qf">
   content^1.0 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0
</str>

2、保存网页的content内容

将schema.xml中的

<field name="content" type="text" stored="false" indexed="true"/>
改为

<field name="content" type="text" stored="true" indexed="true"/>

3、同一时候显示网页文件与一般文本

velocity/results_list.vm

##parse("hit_plain.vm")
将凝视去掉。

4、调整每一个搜索返回项的显示内容

vi richtest_doc.vm

<div> Id: #field('id') </div>
改成：

<div> time: #field('tstamp') </div> <div> score: #field('score') </div>
这种方法能够改动其他字段，详见http://blog.csdn.net/jediael_lu/article/details/38039267
相关阅读:
【2017 Multi-University Training Contest
【“玲珑杯”ACM比赛 Round #20 H】康娜的数学课
 【hdu 6181】Two Paths
Cache coherence protocol
C#实现图片的无损压缩
 收集一些常用的正则表达式
 收集一些常用的正则表达式
 收集一些常用的正则表达式
 Sql中存储过程的定义、修改和删除操作
 Sql中存储过程的定义、修改和删除操作
原文地址：https://www.cnblogs.com/gccbuaa/p/6994881.html

【Nutch2.2.1基础教程之2.2】集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析

Nutch2.2.1抓取流程