• Nutch分类搜索


    环境

    ubuntu11.10

    tomcat6.0.35

    nutch1.2

    笔者想到的分类搜索的方法是根据不同的url建立不同的抓取库,比如要搞电力行业的垂直的搜索,可以将他分为新闻,产品,人才。那麽就建立三个抓取库,每个抓取库都有自己的url入口地址列表。然后配置网站过滤规则达到想要的结果。

    下面笔者将一步一步的讲解他的实现过程。

    首先先要得到相关类别的url入口地址列表,这个可以分类百度一下然后根据结果自己整理出来3个列表。

    以下是笔者整理的三个列表。

    新闻的(文件名newsURL

    http://www.cpnn.com.cn/

    http://news.bjx.com.cn/

    http://www.chinapower.com.cn/news/

    http://news.bjx.com.cn/

    产品的(文件名productURL

    http://www.powerproduct.com/

    http://www.epapi.com/

    http://cnc.powerproduct.com/

    人才的(文件名talentURl

    http://www.cphr.com.cn/

    http://www.ephr.com.cn/

    http://www.myepjob.com/

    http://www.epjob88.com/

    http://hr.bjx.com.cn/

    http://www.epjob.com.cn/

    http://ep.baidajob.com/

    http://www.01hr.com/

    因为是做测试用,所以就不弄太多的地址了。

    做垂直搜索就不能在用nutchcrawl url -dir crawl -depth -topN -threads命令来抓取了,这个命令是做企业内部搜索的,而且不能增量抓取。在这里笔者采用别人已经写好的增量抓取脚本。

    地址http://wiki.apache.org/nutch/Crawl

    因为要建三个抓取库所以要将该脚本修给一下。笔者的抓取库放在/crawldb/news /crawldb/product

    /crawldb/talent,而且将三个url入口文件分别放到相应的分类下面/crawldb/news/newsURL

    /crawldb/product/productURl /crawldb/talent/talentURl。下面是笔者修改后的抓取脚本。使用该脚本要配置NUTCH_HOMECATALINA_HOME环境变量。



    #!/bin/bash

    #############################电力新闻抓增量取部分################################runbot script to run the Nutch bot for crawling and re-crawling.

    #Usage: bin/runbot [safe]

    # If executed in 'safe' mode, it doesn't delete the temporary

    # directories generated during crawl. This might be helpful for

    # analysis and recovery in case a crawl fails.

    #

    #Author: Susam Pal

    echo"-----开始电力新闻增量抓取-----"

    cd/crawldb/news

    depth=5

    threads=100

    adddays=5

    topN=5000#Comment this statement if you don't want to set topN value

    #Arguments for rm and mv

    RMARGS="-rf"

    MVARGS="--verbose"

    #Parse arguments

    if[ "$1" == "safe" ]

    then

    safe=yes

    fi

    if[ -z "$NUTCH_HOME" ]

    then

    NUTCH_HOME=.

    echorunbot: $0 could not find environment variable NUTCH_HOME

    echorunbot: NUTCH_HOME=$NUTCH_HOME has been set by the script

    else

    echorunbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME

    fi

    if[ -z "$CATALINA_HOME" ]

    then

    CATALINA_HOME=/opt/apache-tomcat-6.0.10

    echorunbot: $0 could not find environment variable NUTCH_HOME

    echorunbot: CATALINA_HOME=$CATALINA_HOME has been set by the script

    else

    echorunbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME

    fi

    if[ -n "$topN" ]

    then

    topN="-topN$topN"

    else

    topN=""

    fi

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    echo"runbot: FINISHED: -----电力新闻增量抓取完毕!-----"

    echo""

    #############################电力产品增量抓取部分################################

    echo"-----开始电力产品增量抓取-----"

    cd/crawldb/product

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    echo"runbot: FINISHED:-----电力产品增量抓取完毕!-----"

    echo""

    ###############################电力人才增量抓取部分############################

    echo"-----开始电力人才增量抓取!-----"

    cd/crawldb/talent

    steps=8

    echo"----- Inject (Step 1 of $steps) -----"

    $NUTCH_HOME/bin/nutchinject crawl/crawldb urls

    echo"----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"

    for((i=0;i < $depth; i++))

    do

    echo"--- Beginning crawl at depth `expr $i + 1` of $depth ---"

    $NUTCH_HOME/bin/nutchgenerate crawl/crawldb crawl/segments $topN \

    -adddays$adddays

    if[ $? -ne 0 ]

    then

    echo"runbot: Stopping at depth $depth. No more URLs to fetch."

    break

    fi

    segment=`ls-d crawl/segments/* | tail -1`

    $NUTCH_HOME/bin/nutchfetch $segment -threads $threads

    if[ $? -ne 0 ]

    then

    echo"runbot: fetch $segment at depth `expr $i + 1` failed."

    echo"runbot: Deleting segment $segment."

    rm$RMARGS $segment

    continue

    fi

    $NUTCH_HOME/bin/nutchupdatedb crawl/crawldb $segment

    done

    echo"----- Merge Segments (Step 3 of $steps) -----"

    $NUTCH_HOME/bin/nutchmergesegs crawl/MERGEDsegments crawl/segments/*

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/segments

    else

    rm$RMARGS crawl/BACKUPsegments

    mv$MVARGS crawl/segments crawl/BACKUPsegments

    fi

    mv$MVARGS crawl/MERGEDsegments crawl/segments

    echo"----- Invert Links (Step 4 of $steps) -----"

    $NUTCH_HOME/bin/nutchinvertlinks crawl/linkdb crawl/segments/*

    echo"----- Index (Step 5 of $steps) -----"

    $NUTCH_HOME/bin/nutchindex crawl/NEWindexes crawl/crawldb crawl/linkdb \

    crawl/segments/*

    echo"----- Dedup (Step 6 of $steps) -----"

    $NUTCH_HOME/bin/nutchdedup crawl/NEWindexes

    echo"----- Merge Indexes (Step 7 of $steps) -----"

    $NUTCH_HOME/bin/nutchmerge crawl/NEWindex crawl/NEWindexes

    echo"----- Loading New Index (Step 8 of $steps) -----"

    ${CATALINA_HOME}/bin/shutdown.sh

    if[ "$safe" != "yes" ]

    then

    rm$RMARGS crawl/NEWindexes

    rm$RMARGS crawl/index

    else

    rm$RMARGS crawl/BACKUPindexes

    rm$RMARGS crawl/BACKUPindex

    mv$MVARGS crawl/NEWindexes crawl/BACKUPindexes

    mv$MVARGS crawl/index crawl/BACKUPindex

    fi

    mv$MVARGS crawl/NEWindex crawl/index

    ${CATALINA_HOME}/bin/startup.sh

    echo"runbot: FINISHED:-----电力人才增量抓取完毕!-----"

    echo""

    将上面的代码复制到你的linux上,然后给他可执行的权限chmod755

    下载还不能抓取页面,要在$NUTCH_HOME/conf/regex.urlfilter.txt中配置url过滤规则

    我的配置如下

    #Licensed to the Apache Software Foundation (ASF) under one or more

    #contributor license agreements. See the NOTICE file distributedwith

    #this work for additional information regarding copyright ownership.

    #The ASF licenses this file to You under the Apache License, Version2.0

    #(the "License"); you may not use this file except incompliance with

    #the License. You may obtain a copy of the License at

    #

    # http://www.apache.org/licenses/LICENSE-2.0

    #

    #Unless required by applicable law or agreed to in writing, software

    #distributed under the License is distributed on an "AS IS"BASIS,

    #WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express orimplied.

    #See the License for the specific language governing permissions and

    #limitations under the License.

    #The default url filter.

    #Better for whole-internet crawling.

    #Each non-comment, non-blank line contains a regular expression

    #prefixed by '+' or '-'. The first matching pattern in the file

    #determines whether a URL is included or ignored. If no pattern

    #matches, the URL is ignored.

    #skip file: ftp: and mailto: urls

    -^(file|ftp|mailto):

    #skip image and other suffixes we can't yet parse

    -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

    #skip URLs containing certain characters as probable queries, etc.

    +[?*!@=]

    #skip URLs with slash-delimited segment that repeats 3+ times, tobreak loops

    -.*(/[^/]+)/[^/]+\1/[^/]+\1/

    -.*\.js

    #accept anything else

    +^http://([a-z0-9]*\.)*cpnn.com.cn/

    +^http://([a-z0-9]*\.)*cphr.com.cn/

    +^http://([a-z0-9]*\.)*powerproduct.com/

    +^http://([a-z0-9]*\.)*bjx.com.cn/

    +^http://([a-z0-9]*\.)*renhe.cn/

    +^http://([a-z0-9]*\.)*chinapower.com.cn/

    +^http://([a-z0-9]*\.)*ephr.com.cn/

    +^http://([a-z0-9]*\.)*epapi.com/

    +^http://([a-z0-9]*\.)*myepjob.com/

    +^http://([a-z0-9]*\.)*epjob88.com/

    +^http://([a-z0-9]*\.)*xindianli.com/

    +^http://([a-z0-9]*\.)*epjob.com.cn/

    +^http://([a-z0-9]*\.)*baidajob.com/

    +^http://([a-z0-9]*\.)*01hr.com/

    接下来配置$NUTCH_HOME/conf/nutch-site.xml如下

    <?xmlversion="1.0"?>

    <?xml-stylesheettype="text/xsl" href="configuration.xsl"?>


    <!--Put site-specific property overrides in this file. -->


    <configuration>

    <property>

    <name>http.agent.name</name>

    <value>justa test</value>

    <description>Test</description>

    </property>

    </configuration>

    上述步骤都成功了的话,就可以用刚才的脚本抓取了。这里要注意你的抓取数据的存放目录,请在抓取脚本的相应位置做出更改,以适应你的目录结构。

    抓取完成后就是要将搭建搜索环境了。

    nutch目录下的war包放到tomcatwebapps目录下,待其自己解压。将ROOT该目录下已有的东西删掉,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml文件如下

    <?xmlversion="1.0"?>

    <?xml-stylesheettype="text/xsl" href="configuration.xsl"?>


    <!--Put site-specific property overrides in this file. -->


    <configuration>

    <property>

    <name>searcher.dir</name>

    <value>/crawldb/news/crawl</value>

    </property>

    <property>

    <name>http.agent.name</name>

    <value>tangmiSpider</value>

    <description>MySearch Engine</description>

    </property>

    <property>

    <name>plugin.includes</name>

    <value>protocol-http|urlfilter-regex|parse-(text|html|js)|analysis-(zh)|index-basic|query-(basic|site|url)|summary-lucene|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

    </property>

    </configuration>

    其中的search.dir的值是你的抓取数据的存放目录,请做出相应的更改。在webapps目录下建立两个目录talentproduct,将刚才解压目录中的东西复制到其中,并修改WEB-INF/classes/nutch-site.xml,将searcher.dir的分别设置为/crawldb/talent/crawl/crawldb/product/crawl。至此就可以进行分类搜索了。进行搜索是请进输入相应的url

    我的结果页面

     

     

  • 相关阅读:
    获取某个文件夹下面的子文件夹(要求是第一级)
    操作手册
    GWT与GXT
    eclipse中出现:The project cannot be built until build path errors are resolved
    eclipse部署项目要做的工作及配置
    如何测试tomcat安装成功
    tomcat的安装及eclipse配置
    配置jdk
    oracle数据库的安装、完全卸载与plsql的安装以及与oracle的连接
    UVA
  • 原文地址:https://www.cnblogs.com/fengfengqingqingyangyang/p/3111185.html
Copyright © 2020-2023  润新知