• NUTCH2.3 hadoop2.7.1 hbase1.0.1.1 solr5.2.1部署(三)


    

    Precondition:

    hadoop 2.7.1
    hbase 0.98.13
    solr 5.2.1 / Apache Solr 4.8.1
    http://archive.apache.org/dist/lucene/solr/4.8.1/
    gora 0.6.1


    gora编译和Nutch编译部署

    1. Gora下载

    最新版本号呢gora是0.6.1,下载或者直接通过git获取 git clonehttps://github.com/apache/gora.git

    2.  改动gora pom.xml

    下面可能是Nutch2.3能终于执行的关键。没有1.0.1.1-hadoop2:)

    <hadoop-1.version>1.2.1</hadoop-1.version>
    <hadoop-2.version>2.7.1</hadoop-2.version>
    <hadoop-1.test.version>1.2.1</hadoop-1.test.version>
    <hadoop-2.test.version>2.7.1</hadoop-2.test.version>
    <hbase.version>0.98.13-hadoop2</hbase.version>
    <hbase.test.version>0.98.13-hadoop2</hbase.test.version>

    3. 编译gora

    mvn clean install -DskipTests
    mvn install -DskipTests

    4. 改动$NUTCH_HOME/conf/nutch-site.xml

    <configuration>
    <property>
    	<name>storage.data.store.class</name>
    	<value>org.apache.gora.hbase.store.HBaseStore</value>
    	<description>Default class for storing data</description>
    </property>
    <property>
    	<name>http.agent.name</name>
    	<value>My Nutch Spider</value>
    </property>
    <property>
    	<name>plugin.includes</name>
    	<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    </property>
    </configuration>

    5. 改动$NUTCH_HOME/ivy/ivy.xml

    全部"org.apache.gora"涉及到的rev改动为0.6。比如:

    <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" /> =>
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.6" conf="*->default" />

    删除"org.apache.hadoop"。加入:

    <dependency org="org.apache.hadoop" name="hadoop-client" rev="2.7.1" conf="*->default"/> 
    

    6.改动$NUTCH_HOME/ivy/ivysettings.xml

    <ivysettings>     
    <settings defaultResolver="default"/>     
    <property name="m2-pattern" value="${user.home}/.m2/repository/[organisation]/[module]/[revision]/[module]-[revision](-[classifier]).[ext]" override="false" />     
    <resolvers>         
    <chain name="default">             
    <filesystem name="local-maven2" m2compatible="true" >                 
    <artifact pattern="${m2-pattern}"/>                 
    <ivy pattern="${m2-pattern}"/>             
    </filesystem>             
    <ibiblio name="central" m2compatible="true"/>         
    </chain>     
    </resolvers> 
    </ivysettings> 

    7. $NUTCH_HOME/conf/gora.properties 加入

    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
    8. 依据须要改动 $NUTCH_HOME/conf/regex-urlfilter.txt $NUTCH_HOME/conf/nutch-default.xml

    能够不用改

    9. 编译。要非常长时间

    ant runtime

    10. 将gora以下的hadoop*.jar复制到runtime/local/lib/

    cp /disk/gora/gora-core/lib/hadoop* /disk2/nutch/nutch-2.3/runtime/local/lib/

    11. 建立搜索url

    mkdir urls
    echo
    http://nutch.apache.org/ >> urls/seek.txt

    12. 測试执行

    cd runtime/local/

    bin/nutch inject urls/seek.txt


    solr5.2.1 部署执行

    1. 下载解压

    2. example/example-DIH 包括了完整的solr home配置,复制到server/solr

    cp -rf /disk2/solr/solr-5.2.1/example/example-DIH/solr/* /disk2/solr/solr-5.2.1/server/solr/

    3. 解决Nutch执行中可能遇到的Error 404: Prob accessing /solr/solr/update. Reason: Not Found

    cd /disk2/solr/solr-5.2.1/server/solr

    cp /disk2/solr/solr-5.2.1/example/exampledocs/monitor.xml .

    curl http://127.0.0.1:8983/solr/solr/update --data-binary @monitor.xml -H 'Content-type:application/xml'

    3. 为nutch crawl执行。还要改动/disk2/solr/solr-5.2.1/server/solr/solr/conf/schema.xml。加上:

    	<field name="host" type="string" stored="false" indexed="true"/>
    	<field name="site" type="string" stored="false" indexed="true"/>
    	<field name="cache" type="string" stored="true" indexed="false"/>
    	<field name="digest" type="string" stored="true" indexed="false"/>
    	<field name="segment" type="string" stored="true" indexed="false"/>
    	<field name="boost" type="float" stored="true" indexed="false"/>
    	<field name="tstamp" type="date" stored="true" indexed="false"/>
    	<field name="stamp" type="date" stored="true" indexed="false"/>  
    	<field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>  
    
    4. bin/solr start

    5. http://192.168.1.106:8983/solr

    6. bin/crawl urls/seek.txt TestCrawl http://192.168.1.106:8983/solr/solr 2


    FAQ

    以下是过程中遇到的让人愤慨的。。。

    1. 错误: 找不到或无法载入主类 org.apache.nutch.crawl.InjectorJob:
    没有ant runtime

    2. Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration

    nutch2.3 须要使用hbase 0.98.13 的几个hbase-comm*.jar / hbase-client*.jar / hbase-protocol*.jar。千万不要用hbase1.0.1.1的。
    cd /disk2/hbase/hbase-0.98.13-hadoop2/lib
    cp hbase-common* /disk2/nutch/nutch-2.3/runtime/local/lib/

    cp hbase-client-0.98.13-hadoop2.jar /disk2/nutch/nutch-2.3/runtime/local/lib/
    cp hbase-protocol* /disk2/nutch/nutch-2.3/runtime/local/lib/

    3. Exception in thread "main" java.lang.NoSuchFieldError: HBASE_CLIENT_PREFETCH_LIMIT
    原因同上。hbase 和 nutch不匹配

    4. 2015-07-21 13:53:53,238 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    就让他native好了
    mkdir -p /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
    cd /disk2/hadoop/hadoop-2.7.1/lib/native/
    cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/Linux-amd64-64/
    cp * /disk2/nutch/nutch-2.3/runtime/local/lib/native/


    
    
    
  • 相关阅读:
    python多线程实现抓取网页
    调用百度地图实如今地图上定位
    Java创建二叉树
    J2EE的13个规范
    现场故障 案例:控制文件损坏
    数据库原理常见问答
    Lucene整理--中文分词
    Linux发行版
    python中异常好用的工具
    python有趣的一行代码
  • 原文地址:https://www.cnblogs.com/yangykaifa/p/6781351.html
Copyright © 2020-2023  润新知