Injector Job的主要功能是根据crawlId在hbase中创建一个表,将将文本中的seed注入表中。
(一)命令执行
1、运行命令
[jediael@master local]$ bin/nutch inject seeds/ -crawlId sourcetest InjectorJob: starting at 2015-03-10 14:59:19 InjectorJob: Injecting urlDir: seeds InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class. InjectorJob: total number of urls rejected by filters: 0 InjectorJob: total number of urls injected after normalization and filtering: 1 Injector: finished at 2015-03-10 14:59:26, elapsed: 00:00:06
2、查看表中内容
hbase(main):004:0> scan 'sourcetest_webpage' ROW COLUMN+CELL com.163.money:http/ column=f:fi, timestamp=1425970761871, value=x00'x8Dx00 com.163.money:http/ column=f:ts, timestamp=1425970761871, value=x00x00x01Lx02{x08_ com.163.money:http/ column=mk:_injmrk_, timestamp=1425970761871, value=y com.163.money:http/ column=mk:dist, timestamp=1425970761871, value=0 com.163.money:http/ column=mtdt:_csh_, timestamp=1425970761871, value=?x80x00x00 com.163.money:http/ column=s:s, timestamp=1425970761871, value=?x80x00x00 1 row(s) in 0.0430 seconds
3、读取数据库中的内容
由于hbase表使用了字节码表示内容,因此需要通过以下命令来查看具体内容
[jediael@master local]$ bin/nutch readdb -dump ./test -crawlId sourcetest -content WebTable dump: starting WebTable dump: done [jediael@master local]$ cat test/part-r-00000 http://money.163.com/ key: com.163.money:http/ baseUrl: null status: 0 (null) fetchTime: 1425970759775 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus: (null) title: null score: 1.0 marker _injmrk_ : y marker dist : 0 reprUrl: null metadata _csh_ : ?锟
(二)源码流程分析
类:org.apache.nutch.crawl.InjectorJob
1、程序入口
public static void main(String[] args) throws Exception { int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(), args); System.exit(res); }
2、ToolRunner.run(String[] args)
此步骤主要是调用inject方法,其余均是一些参数合规性的检查
public int run(String[] args) throws Exception { ………… inject(new Path(args[0])); ………… }
3、inject()方法
nutch均使用 Map<String, Object> run(Map<String, Object> args)来运行具体的job,即其使用Map类参数,并返回Map类参数。
<pre name="code" class="java">public void inject(Path urlDir) throws Exception { run(ToolUtil.toArgMap(Nutch.ARG_SEEDDIR, urlDir)); }
4、job的具体配置,并创建hbase中的表格
public Map<String, Object> run(Map<String, Object> args) throws Exception { numJobs = 1; currentJobNum = 0; currentJob = new NutchJob(getConf(), "inject " + input); FileInputFormat.addInputPath(currentJob, input); currentJob.setMapperClass(UrlMapper.class); currentJob.setMapOutputKeyClass(String.class); currentJob.setMapOutputValueClass(WebPage.class); currentJob.setOutputFormatClass(GoraOutputFormat.class); DataStore<String, WebPage> store = StorageUtils.createWebStore( currentJob.getConfiguration(), String.class, WebPage.class); GoraOutputFormat.setOutput(currentJob, store, true); currentJob.setReducerClass(Reducer.class); currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true); ToolUtil.recordJobStatus(null, currentJob, results); }
5、mapper方法
由于Injector Job中无reducer,因此只要关注mapper即可。
mapper主要完成以下几项工作:
(1)对文本中的内容进行分析,并提取其中的参数
(2)根据filter过滤url
(3)反转url作为key,创建Webpage对象作为value,然后将之写入表中。
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String url = value.toString().trim(); // value is line of text if (url != null && (url.length() == 0 || url.startsWith("#"))) { /* Ignore line that start with # */ return; } // if tabs : metadata that could be stored // must be name=value and separated by float customScore = -1f; int customInterval = interval; Map<String, String> metadata = new TreeMap<String, String>(); if (url.indexOf(" ") != -1) { String[] splits = url.split(" "); url = splits[0]; for (int s = 1; s < splits.length; s++) { // find separation between name and value int indexEquals = splits[s].indexOf("="); if (indexEquals == -1) { // skip anything without a = continue; } String metaname = splits[s].substring(0, indexEquals); String metavalue = splits[s].substring(indexEquals + 1); if (metaname.equals(nutchScoreMDName)) { try { customScore = Float.parseFloat(metavalue); } catch (NumberFormatException nfe) { } } else if (metaname.equals(nutchFetchIntervalMDName)) { try { customInterval = Integer.parseInt(metavalue); } catch (NumberFormatException nfe) { } } else metadata.put(metaname, metavalue); } } try { url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT); url = filters.filter(url); // filter the url } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e); url = null; } if (url == null) { context.getCounter("injector", "urls_filtered").increment(1); return; } else { // if it passes String reversedUrl = TableUtil.reverseUrl(url); // collect it WebPage row = WebPage.newBuilder().build(); row.setFetchTime(curTime); row.setFetchInterval(customInterval); // now add the metadata Iterator<String> keysIter = metadata.keySet().iterator(); while (keysIter.hasNext()) { String keymd = keysIter.next(); String valuemd = metadata.get(keymd); row.getMetadata().put(new Utf8(keymd), ByteBuffer.wrap(valuemd.getBytes())); } if (customScore != -1) row.setScore(customScore); else row.setScore(scoreInjected); try { scfilters.injectedScore(url, row); } catch (ScoringFilterException e) { if (LOG.isWarnEnabled()) { LOG.warn("Cannot filter injected score for url " + url + ", using default (" + e.getMessage() + ")"); } } context.getCounter("injector", "urls_injected").increment(1); row.getMarkers() .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0))); Mark.INJECT_MARK.putMark(row, YES_STRING); context.write(reversedUrl, row); } }
(三)重点源码学习