• Netbeans导入Nutch1.2


    Windows 环境下,Netbeans下导入Nutch1.2。 

    测试环境:

       Nutch 1.2

       Netbean7.4

       Java 1.8.0_20

       cygwin

    安装步骤:

      1.安装:Cygwin

      •下载 Nutch1.2  (下载地址:http://archive.apache.org/dist/nutch/)

      •按照教程进行安装 (教程地址: http://wiki.apache.org/nutch/NutchTutorial)

     环境变量配置如下:

               classpath:  .;%JAVA_HOME%lib;%JAVA_HOME%lib ools.jar

               JAVA_HOME: D:Program FilesJavajdk1.8.0_20

               CATALINA_HOME: D: errySoftwareJavaapache-tomcat-7.0.55 

               ANT_HOME:D: erryWorkJavaapache-ant-1.9.4-binapache-ant-1.9.4

               Path: %JAVA_HOME%in;%CATALINA_HOME%in;%CATALINA_HOME%lib;%ANT_HOME%in;%JAVA_HOME%jrein;

               系统环境变量部分: Path:C:WindowsSystem32;D:cygwinin;  

    2.在Netbeans中创建项目   

    创建项目过程:

               启动 Netbeans7.4  文件->新建项目->常规;

      •选择“基于现有源代码的Java项目”->下一步,选择项目名称和项目文件夹,可以根据自己的需要自行指定;

      •选择下一步,点击“源包文件夹”文本框右边的“添加文件夹”按钮,浏览文件目录选择Nutch安装目录下的src文件夹;

      •点击完成,此时工程已经建立,当需要对项目进行配置才能实现代码的调试;

           导入文件及Jar包过程: 

      •在左边的“项目”导航窗口中,选择库包,点击右键,选择属性,此时弹出项目属性配置窗口

      •选择配置“库”,点击按钮“添加JAR/文件夹”,添加Nutch安装目录下的 "conf"目录下所有文件;

      •继续将Nutch安装目录下“lib”和“plugin”文件夹中的所有JAR包添加进来,此处比较麻烦,因为Netbeans不能自动扫描到文件夹下所有的JAR包,必须手工将其添加进来。

      此时,您可以选择左方“项目”导航窗口中的项目图标,点击右键,选择“生成项目”,项目应该能正确通过编译。您还需要对Nutch进行配置以调试Nutch的爬行代码。

     配置 Nutch

      在Nutch安装目录下,找到文件/conf/nutch-defaul.xml, 将其 "plugin.folders" 属性值改为"Nutch安装目录/src/plugin"

            运行  

     在Nutch中运行Nutch爬虫

      •选择左方“项目”导航窗口中的项目图标,点击右键,选择“属性”,在左方类别窗口中选择“运行”,对运行参数进行配置

      •主类: 选择 org.apache.nutch.crawl.Crawl

      •参数: 填入 urls/microsoft.txt -dir crawl -depth 3 -topN 50

      •VM选项:填入 -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

      •点击“确定”按钮退出

      •选择左方“项目”导航窗口中的项目图标,点击右键,选择“运行项目”

     运行过程中出现的错误及解决办法:

         1  mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 

    2014-10-17 13:20:45,384 INFO  crawl.Crawl - crawl started in: crawl

    2014-10-17 13:20:45,384 INFO  crawl.Crawl - rootUrlDir = urls/microsoft_de-DE.txt
    2014-10-17 13:20:45,384 INFO  crawl.Crawl - threads = 10
    2014-10-17 13:20:45,384 INFO  crawl.Crawl - depth = 3
    2014-10-17 13:20:45,384 INFO  crawl.Crawl - indexer=lucene
    2014-10-17 13:20:45,384 INFO  crawl.Crawl - topN = 50
    2014-10-17 13:20:59,721 INFO  crawl.Injector - Injector: starting at 2014-10-17 13:20:55
    2014-10-17 13:21:00,719 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
    2014-10-17 13:21:01,530 INFO  crawl.Injector - Injector: urlDir: urls/microsoft_de-DE.txt
    2014-10-17 13:21:11,187 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
    2014-10-17 13:21:45,241 WARN  mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    2014-10-17 13:21:45,850 WARN  mapred.JobClient - No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

     出错原因找不到Mapper类和Reduce类。

    解决办法: 

      将本工程导出jar包放到工程根目录下面,并在Crawl.java类中获取到配置文件后重新加载mapred.jar包。

     conf.set("mapred.jar", "NutchSource.jar"); NutchSource.jar是Nutch项目编译后的Jar包。

     代码如下:  public static void main(String args[]) throws Exception {

        if (args.length < 1) {
          System.out.println
          ("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]" +
            " [-solr solrURL]");
          return;
        }

        Configuration conf = NutchConfiguration.createCrawlConfiguration();
        conf.set("mapred.jar", "NutchSource.jar"); 
        JobConf job = new NutchJob(conf);

      2. WARN  plugin.PluginRepository - Plugins: directory not found: plugins

    2014-10-17 13:43:29,729 INFO  crawl.Crawl - crawl started in: crawl
    2014-10-17 13:43:29,741 INFO  crawl.Crawl - rootUrlDir = urls/microsoft_de-DE.txt
    2014-10-17 13:43:29,743 INFO  crawl.Crawl - threads = 10
    2014-10-17 13:43:29,744 INFO  crawl.Crawl - depth = 3
    2014-10-17 13:43:29,745 INFO  crawl.Crawl - indexer=lucene
    2014-10-17 13:43:29,747 INFO  crawl.Crawl - topN = 50
    2014-10-17 13:43:32,080 INFO  crawl.Injector - Injector: starting at 2014-10-17 13:43:32
    2014-10-17 13:43:32,083 INFO  crawl.Injector - Injector: crawlDb: crawl/crawldb
    2014-10-17 13:43:32,085 INFO  crawl.Injector - Injector: urlDir: urls/microsoft_de-DE.txt
    2014-10-17 13:43:32,220 INFO  crawl.Injector - Injector: Converting injected urls to crawl db entries.
    2014-10-17 13:43:45,367 WARN  mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    2014-10-17 13:43:57,923 WARN  plugin.PluginRepository - Plugins: directory not found: plugins
    2014-10-17 13:43:57,924 INFO  plugin.PluginRepository - Plugin Auto-activation mode: [true]
    2014-10-17 13:43:57,924 INFO  plugin.PluginRepository - Registered Plugins:
    2014-10-17 13:43:57,929 INFO  plugin.PluginRepository - NONE
    2014-10-17 13:43:57,930 INFO  plugin.PluginRepository - Registered Extension-Points:
    2014-10-17 13:43:57,930 INFO  plugin.PluginRepository - NONE
    2014-10-17 13:43:58,010 WARN  mapred.LocalJobRunner - job_local_0001
    java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
    Caused by: java.lang.reflect.InvocationTargetException

    出错原因:找不到plugins此目录。

    解决办法:查看conf目录下的nutch-default.xml中的plugin.folders的Value值在项目路径下是否存在

    <property>
      <name>plugin.folders</name>
      <value>plugins</value>
      <description>Directories where nutch plugins are located.  Each
      element may be a relative or absolute path.  If absolute, it is used
      as is.  If relative, it is searched for on the classpath.</description>

    </property> 

    把此值改为项目下src/plugin.

  • 相关阅读:
    31.迭代器丶生成器
    30.面向对象中常用内建函数与重载函数丶自定义手动报错
    安装补全命令的包
    安装yum
    centos7时间同步
    yum解决 "Couldn't resolve host 'apt.sw.be'" 错误
    centos6多实例安装mysql
    openstack--部暑
    kvm安装
    如何将本地大文件通过终端上传到linux服务器
  • 原文地址:https://www.cnblogs.com/abcdwxc/p/4031158.html
Copyright © 2020-2023  润新知