• 在Eclipse中运行Nutch2.3



    参考http://wiki.apache.org/nutch/RunNutchInEclipse


    一、环境准备

    1、下载nutch2.3源代码

    1. wget http://mirror.bit.edu.cn/apache/nutch/2.3/apache-nutch-2.3-src.tar.gz  
    或者下载正在开发中的最新版本
    1. svn co https://svn.apache.org/repos/asf/nutch/branches/2.x  


    2、选择使用的数据库类型,以hbase为例
    在conf/nutch-site.xml中增加以下属性:

    1. <property>  
    2.   <name>storage.data.store.class</name>  
    3.   <value>org.apache.gora.hbase.store.HBaseStore</value>  
    4.   <description>Default class for storing data</description>  
    5.  </property>  


    3、在ivy/ivy.xml中增加与hbase相关的依赖项,此项本已存在,但被注释掉,将注释去掉即可

    1. <dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default” />  
    注意,rev=0.5对应hbase0.94,rev=0.3对应hbase0.90.4


    4、在nutch.xml中增加以下3个属性

    1. <property>  
    2.    <name>http.agent.name</name>  
    3.    <value>My Nutch Spider</value>  
    4.  </property>  
    5. <property>  
    6.    <name>http.robots.agents</name>  
    7.    <value>none</value>  
    8.  </property>  
    9. <property>  
    10.    <name>plugin.folders</name>  
    11.    <value>/Users/liaoliuqing/0_Search/1_Nutch/1_Official/apache-nutch-2.3/build/plugins</value>  
    12.  </property>  
    其中plugin.folders的值为$NUTCH_HOME/build/plugins


    5、执行ant eclipse


    二、导入project

    1、导入project


    2、在build path中,将apche-nutch-2.3/conf放到最上面,即点击top按键



    三、运行程序

    1、Run as ----> Run configuration,选择project与主类


    2、填写参数

    /Users/liaoliuqing/Downloads/seed.txt

    -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log


    3、点击run,输出结果如下:

    InjectorJob: starting at 2015-01-28 16:27:43
    InjectorJob: Injecting urlDir: /Users/liaoliuqing/Downloads/seed.txt
    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    InjectorJob: total number of urls rejected by filters: 0
    InjectorJob: total number of urls injected after normalization and filtering: 1
    Injector: finished at 2015-01-28 16:27:47, elapsed: 00:00:04


    注意,在运行程序前,本机需要先启动hbase。


    4、查看hbase中的数据

    1. hbase(main):003:0> scan 'webpage'  
    2. ROW                                         COLUMN+CELL                                                                                                                   
    3.  com.163.www:http/                          column=f:fi, timestamp=1422433667377, value=x00'x8Dx00                                                                     
    4.  com.163.www:http/                          column=f:ts, timestamp=1422433667377, value=x00x00x01K/xA7:x14                                                           
    5.  com.163.www:http/                          column=mk:_injmrk_, timestamp=1422433667377, value=y                                                                          
    6.  com.163.www:http/                          column=mk:dist, timestamp=1422433667377, value=0                                                                              
    7.  com.163.www:http/                          column=mtdt:_csh_, timestamp=1422433667377, value=?x80x00x00                                                               
    8.  com.163.www:http/                          column=s:s, timestamp=1422433667377, value=?x80x00x00                                                                      
    9. 1 row(s) in 0.2970 seconds  






  • 相关阅读:
    逆向随笔
    Test for Required Behavior, Not Incidental Behavior
    Volley 解析
    使用Apache JMeter压測Thrift
    hdu 5289 Assignment(给一个数组,求有多少个区间,满足区间内的最大值和最小值之差小于k)
    ORACLE 11G在存储过程里面遍历游标, 调用job任务定时运行
    Netlink 内核实现分析(二):通信
    6.3 cmath--数学函数
    CodeChef Little Elephant and Mouses [DP]
    BZOJ 1758: [Wc2010]重建计划 [暂时放弃]
  • 原文地址:https://www.cnblogs.com/jpfss/p/7885887.html
Copyright © 2020-2023  润新知