• Injector Job深入分析 分类: H3_NUTCH 2015-03-10 15:44 334人阅读 评论(0) 收藏




    Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。
    (一)命令执行
    1、运行命令
    [jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest
    InjectorJob: starting at 2015-03-10 14:59:19
    InjectorJob: Injecting urlDir: seeds
    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    InjectorJob: total number of urls rejected by filters: 0
    InjectorJob: total number of urls injected after normalization and filtering: 1
    Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06

    2、查看表中内容
    hbase(main):004:0> scan 'sourcetest_webpage'
    ROW                                       COLUMN+CELL                                                                                                           
     com.163.money:http/                      column=f:fi, timestamp=1425970761871, value=x00'x8Dx00                                                            
     com.163.money:http/                      column=f:ts, timestamp=1425970761871, value=x00x00x01Lx02{x08_                                                   
     com.163.money:http/                      column=mk:_injmrk_, timestamp=1425970761871, value=y                                                                 
     com.163.money:http/                      column=mk:dist, timestamp=1425970761871, value=0                                                                      
     com.163.money:http/                      column=mtdt:_csh_, timestamp=1425970761871, value=?x80x00x00                                                       
     com.163.money:http/                      column=s:s, timestamp=1425970761871, value=?x80x00x00                                                             
    1 row(s) in 0.0430 seconds

    3、读取数据库中的内容
    由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容
    [jediael@master local]$ bin/nutch readdb  -dump ./test -crawlId sourcetest -content
    WebTable dump: starting
    WebTable dump: done
    [jediael@master local]$ cat test/part-r-00000
    http://money.163.com/   key:    com.163.money:http/
    baseUrl:        null
    status: 0 (null)
    fetchTime:      1425970759775
    prevFetchTime:  0
    fetchInterval:  2592000
    retriesSinceFetch:      0
    modifiedTime:   0
    prevModifiedTime:       0
    protocolStatus: (null)
    parseStatus:    (null)
    title:  null
    score:  1.0
    marker _injmrk_ :       y
    marker dist :   0
    reprUrl:        null
    metadata _csh_ :        ?锟


    (二)源码流程分析
    类:org.apache.nutch.crawl.InjectorJob
    1、程序入口
     
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),
            args);
        System.exit(res);
      }

    2、ToolRunner.run(String[] args)
    此步骤主要是调用inject方法,其余均是一些参数合规性的检查
     
    public int run(String[] args) throws Exception {
      …………
        inject(new Path(args[0]));
       …………
      }


    3、inject()方法
    nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。
    <pre name="code" class="java">public void inject(Path urlDir) throws Exception {
    
        run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir));
    
      }


    
    

    4、job的具体配置,并创建hbase中的表格
    public Map<String, Object> run(Map<String, Object> args) throws Exception {
       
        numJobs = 1;
        currentJobNum = 0;
        currentJob = new NutchJob(getConf(), "inject " + input);
        FileInputFormat.addInputPath(currentJob, input);
        currentJob.setMapperClass(UrlMapper.class);
        currentJob.setMapOutputKeyClass(String.class);
        currentJob.setMapOutputValueClass(WebPage.class);
        currentJob.setOutputFormatClass(GoraOutputFormat.class);
    
        DataStore<String, WebPage> store = StorageUtils.createWebStore(
            currentJob.getConfiguration(), String.class, WebPage.class);
        GoraOutputFormat.setOutput(currentJob, store, true);
    
        currentJob.setReducerClass(Reducer.class);
        currentJob.setNumReduceTasks(0);
    
        currentJob.waitForCompletion(true);
        ToolUtil.recordJobStatus(null, currentJob, results);
    }
    
    
      


    5、mapper方法
    由于Injector Job中无reducer,因此只要关注mapper即可。
    mapper主要完成以下几项工作:
    (1)对文本中的内容进行分析,并提取其中的参数
    (2)根据filter过滤url
    (3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
          String url = value.toString().trim(); // value is line of text
    
          if (url != null && (url.length() == 0 || url.startsWith("#"))) {
            /* Ignore line that start with # */
            return;
          }
    
          // if tabs : metadata that could be stored
          // must be name=value and separated by 	
          float customScore = -1f;
          int customInterval = interval;
          Map<String, String> metadata = new TreeMap<String, String>();
          if (url.indexOf("	") != -1) {
            String[] splits = url.split("	");
            url = splits[0];
            for (int s = 1; s < splits.length; s++) {
              // find separation between name and value
              int indexEquals = splits[s].indexOf("=");
              if (indexEquals == -1) {
                // skip anything without a =
                continue;
              }
              String metaname = splits[s].substring(0, indexEquals);
              String metavalue = splits[s].substring(indexEquals + 1);
              if (metaname.equals(nutchScoreMDName)) {
                try {
                  customScore = Float.parseFloat(metavalue);
                } catch (NumberFormatException nfe) {
                }
              } else if (metaname.equals(nutchFetchIntervalMDName)) {
                try {
                  customInterval = Integer.parseInt(metavalue);
                } catch (NumberFormatException nfe) {
                }
              } else
                metadata.put(metaname, metavalue);
            }
          }
          try {
            url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
            url = filters.filter(url); // filter the url
          } catch (Exception e) {
            LOG.warn("Skipping " + url + ":" + e);
            url = null;
          }
          if (url == null) {
            context.getCounter("injector", "urls_filtered").increment(1);
            return;
          } else { // if it passes
            String reversedUrl = TableUtil.reverseUrl(url); // collect it
            WebPage row = WebPage.newBuilder().build();
            row.setFetchTime(curTime);
            row.setFetchInterval(customInterval);
    
            // now add the metadata
            Iterator<String> keysIter = metadata.keySet().iterator();
            while (keysIter.hasNext()) {
              String keymd = keysIter.next();
              String valuemd = metadata.get(keymd);
              row.getMetadata().put(new Utf8(keymd),
                  ByteBuffer.wrap(valuemd.getBytes()));
            }
    
            if (customScore != -1)
              row.setScore(customScore);
            else
              row.setScore(scoreInjected);
    
            try {
              scfilters.injectedScore(url, row);
            } catch (ScoringFilterException e) {
              if (LOG.isWarnEnabled()) {
                LOG.warn("Cannot filter injected score for url " + url
                    + ", using default (" + e.getMessage() + ")");
              }
            }
            context.getCounter("injector", "urls_injected").increment(1);
            row.getMarkers()
                .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0)));
            Mark.INJECT_MARK.putMark(row, YES_STRING);
            context.write(reversedUrl, row);
          }
        }



    (三)重点源码学习


    版权声明:本文为博主原创文章,未经博主允许不得转载。

  • 相关阅读:
    linux下查找文件并按时间顺序排序的方法
    动态口令(OTP)认证技术概览
    [转]关于OpenSSL支持USBKEY证书的尝试
    关于CSP通过CpSetKeyParam存入证书相关问题
    Windows AntiDebug Reference
    cryptapi制作证书
    [转]国密SM3杂凑算法与实现
    [转]国密SM2非对称算法与实现
    [转]国密SM4对称算法实现说明(原SMS4无线局域网算法标准)
    证书的申请过程(usbkey)
  • 原文地址:https://www.cnblogs.com/lujinhong2/p/4637207.html
Copyright © 2020-2023  润新知