• 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程





    一、抓取流程概述
    1、nutch抓取流程
    当使用crawl命令进行抓取任务时,其基本流程步骤如下:
    (1)InjectorJob
    开始第一个迭代
    (2)GeneratorJob
    (3)FetcherJob
    (4)ParserJob
    (5)DbUpdaterJob
    (6)SolrIndexerJob
    开始第二个迭代
    (2)GeneratorJob
    (3)FetcherJob
    (4)ParserJob
    (5)DbUpdaterJob
    (6)SolrIndexerJob
    开始第三个迭代
    ……

    2、抓取日志
    使用crawl命令进行抓取时,console输出日志如下:
    
    
    InjectorJob: starting at 2014-07-08 10:41:27
    InjectorJob: Injecting urlDir: urls
    
    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
    
    InjectorJob: total number of urls rejected by filters: 0
    
    InjectorJob: total number of urls injected after normalization and filtering: 2
    
    Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05
    
    Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5
    
    Generating batchId
    
    Generating a new fetchlist
    
    GeneratorJob: starting at 2014-07-08 10:41:34
    
    GeneratorJob: Selecting best-scoring urls due for fetch.
    
    GeneratorJob: starting
    
    GeneratorJob: filtering: false
    
    GeneratorJob: normalizing: false
    
    GeneratorJob: topN: 50000
    
    GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05
    
    GeneratorJob: generated batch id: 1404787293-26339
    
    Fetching : 
    
    FetcherJob: starting
    
    FetcherJob: batchId: 1404787293-26339
    
    Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
    
    FetcherJob: threads: 50
    
    FetcherJob: parsing: false
    
    FetcherJob: resuming: false
    
    FetcherJob : timelimit set for : 1404798101129
    
    Using queue mode : byHost
    
    Fetcher: threads: 50
    
    QueueFeeder finished: total 2 records. Hit by time limit :0
    
    fetching http://www.csdn.net/ (queue crawl delay=5000ms)
    
    Fetcher: throughput threshold: -1
    
    Fetcher: throughput threshold sequence: 5
    
    fetching http://www.itpub.net/ (queue crawl delay=5000ms)
    
    -finishing thread FetcherThread47, activeThreads=48
    
    -finishing thread FetcherThread46, activeThreads=47
    
    -finishing thread FetcherThread45, activeThreads=46
    
    -finishing thread FetcherThread44, activeThreads=45
    
    -finishing thread FetcherThread43, activeThreads=44
    
    -finishing thread FetcherThread42, activeThreads=43
    
    -finishing thread FetcherThread41, activeThreads=42
    
    -finishing thread FetcherThread40, activeThreads=41
    
    -finishing thread FetcherThread39, activeThreads=40
    
    -finishing thread FetcherThread38, activeThreads=39
    
    -finishing thread FetcherThread37, activeThreads=38
    
    -finishing thread FetcherThread36, activeThreads=37
    
    -finishing thread FetcherThread35, activeThreads=36
    
    -finishing thread FetcherThread34, activeThreads=35
    
    -finishing thread FetcherThread33, activeThreads=34
    
    -finishing thread FetcherThread32, activeThreads=33
    
    -finishing thread FetcherThread31, activeThreads=32
    
    -finishing thread FetcherThread30, activeThreads=31
    
    -finishing thread FetcherThread29, activeThreads=30
    
    -finishing thread FetcherThread48, activeThreads=29
    
    -finishing thread FetcherThread27, activeThreads=29
    
    -finishing thread FetcherThread26, activeThreads=28
    
    -finishing thread FetcherThread25, activeThreads=27
    
    -finishing thread FetcherThread24, activeThreads=26
    
    -finishing thread FetcherThread23, activeThreads=25
    
    -finishing thread FetcherThread22, activeThreads=24
    
    -finishing thread FetcherThread21, activeThreads=23
    
    -finishing thread FetcherThread20, activeThreads=22
    
    -finishing thread FetcherThread19, activeThreads=21
    
    -finishing thread FetcherThread18, activeThreads=20
    
    -finishing thread FetcherThread17, activeThreads=19
    
    -finishing thread FetcherThread16, activeThreads=18
    
    -finishing thread FetcherThread15, activeThreads=17
    
    -finishing thread FetcherThread14, activeThreads=16
    
    -finishing thread FetcherThread13, activeThreads=15
    
    -finishing thread FetcherThread12, activeThreads=14
    
    -finishing thread FetcherThread11, activeThreads=13
    
    -finishing thread FetcherThread10, activeThreads=12
    
    -finishing thread FetcherThread9, activeThreads=11
    
    -finishing thread FetcherThread8, activeThreads=10
    
    -finishing thread FetcherThread7, activeThreads=9
    
    -finishing thread FetcherThread5, activeThreads=8
    
    -finishing thread FetcherThread4, activeThreads=7
    
    -finishing thread FetcherThread3, activeThreads=6
    
    -finishing thread FetcherThread2, activeThreads=5
    
    -finishing thread FetcherThread49, activeThreads=4
    
    -finishing thread FetcherThread6, activeThreads=3
    
    -finishing thread FetcherThread28, activeThreads=2
    
    -finishing thread FetcherThread0, activeThreads=1
    
    fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null
    
    -finishing thread FetcherThread1, activeThreads=0
    
    0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues
    
    -activeThreads=0
    
    FetcherJob: done
    
    Parsing : 
    
    ParserJob: starting
    
    ParserJob: resuming:    false
    
    ParserJob: forced reparse:      false
    
    ParserJob: batchId:     1404787293-26339
    
    Parsing http://www.csdn.net/
    
    http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561
    
    Parsing http://www.itpub.net/
    
    ParserJob: success
    
    CrawlDB update for csdnitpub
    
    DbUpdaterJob: starting
    
    DbUpdaterJob: done
    
    Indexing csdnitpub on SOLR index -> http://ip:8983/solr/
    
    SolrIndexerJob: starting
    
    SolrIndexerJob: done.
    
    SOLR dedup -> http://ip:8983/solr/
    
    Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5
    
    Generating batchId
    
    Generating a new fetchlist
    
    GeneratorJob: starting at 2014-07-08 10:42:19
    
    GeneratorJob: Selecting best-scoring urls due for fetch.
    
    GeneratorJob: starting
    
    GeneratorJob: filtering: false
    
    GeneratorJob: normalizing: false
    
    GeneratorJob: topN: 50000
    
    GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05
    
    GeneratorJob: generated batch id: 1404787338-30453
    
    Fetching : 
    
    FetcherJob: starting
    
    FetcherJob: batchId: 1404787338-30453
    
    Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
    
    FetcherJob: threads: 50
    
    FetcherJob: parsing: false
    
    FetcherJob: resuming: false
    
    FetcherJob : timelimit set for : 1404798146676
    
    Using queue mode : byHost
    
    Fetcher: threads: 50
    
    QueueFeeder finished: total 0 records. Hit by time limit :0
    
    
    二、使用命令进行逐步抓取
    
    
    1、InjectorJob
    此步骤将seed.txt中的url注入抓取队列中进行初始化。
    (1)基本命令
    

    $ bin/nutch inject 

    Usage: InjectorJob <url_dir> [-crawlId <id>]

    $ bin/nutch inject urls

    InjectorJob: starting at 2014-12-20 22:32:01

    InjectorJob: Injecting urlDir: urls

    InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

    InjectorJob: total number of urls rejected by filters: 0

    InjectorJob: total number of urls injected after normalization and filtering: 1

    Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14

    其中urls/seed.txt的内容如下: http://stackoverflow.com/
    (2)查看注入的url
    上述步骤会在hbase中新建一个表,表名为test_1_webpage,url的相应内容会写入这张表
    hbase(main):002:0> scan '334_webpage'
    ROW                              COLUMN+CELL                                                                               
     com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=x00'x8Dx00                                  
     com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=x00x00x01Hx0C&x11x8D                     
     com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       
     com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           
     com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?x80x00x00                            
     com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?x80x00x00                                   
    1 row(s) in 0.3020 seconds
    (3)关于**_webpage表
    对于每一个任务,均会生成一个crawlId_webpage的表,所有已抓取及未抓取的url相关信息均会存入此表。
    若url未抓取,则该url相应的行信息较少。若url已经抓取,则抓取到的内容也会放入该行,如网页内容等。
    
    
    2、GeneratorJob
    (1)基本命令
    [jediael@jediael local]$  bin/nutch generate -crawlId 334
    GeneratorJob: starting at 2014-08-25 15:57:12
    GeneratorJob: Selecting best-scoring urls due for fetch.
    GeneratorJob: starting
    GeneratorJob: filtering: true
    GeneratorJob: normalizing: true
    GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
    GeneratorJob: generated batch id: 1408953432-1171377744
    (2)命令选项
    [root@jediael local]# bin/nutch generate
    Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
     -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
       -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)"); 
       -noFilter      - do not activate the filter plugin to filter the url, default is true 
        -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 
        -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.
        -batchId       - the batch id 
    ----------------------
    Please set the params.
    (3)查看数据库
    hbase(main):003:0> scan '334_webpage' 
    ROW                              COLUMN+CELL                                                                                
     com.stackoverflow:http/         column=f:bid, timestamp=1408953437910, value=1408953432-1171377744                         
     com.stackoverflow:http/         column=f:fi, timestamp=1408953100271, value=x00'x8Dx00                                  
     com.stackoverflow:http/         column=f:ts, timestamp=1408953100271, value=x00x00x01Hx0C&x11x8D                     
     com.stackoverflow:http/         column=mk:_gnmrk_, timestamp=1408953437910, value=1408953432-1171377744                    
     com.stackoverflow:http/         column=mk:_injmrk_, timestamp=1408953100271, value=y                                       
     com.stackoverflow:http/         column=mk:dist, timestamp=1408953100271, value=0                                           
     com.stackoverflow:http/         column=mtdt:_csh_, timestamp=1408953100271, value=?x80x00x00                            
     com.stackoverflow:http/         column=s:s, timestamp=1408953100271, value=?x80x00x00                                   
    1 row(s) in 0.0490 seconds
    此步骤新增了f:bid,mk:_gnmrk_  两列
    3、FetcherJob
    (1)基本命令
    [jediael@jediael local]$  bin/nutch generate -crawlId 334
    GeneratorJob: starting at 2014-08-25 15:57:12
    GeneratorJob: Selecting best-scoring urls due for fetch.
    GeneratorJob: starting
    GeneratorJob: filtering: true
    GeneratorJob: normalizing: true
    GeneratorJob: finished at 2014-08-25 15:57:18, time elapsed: 00:00:06
    GeneratorJob: generated batch id: 1408953432-1171377744
    [jediael@jediael local]$  bin/nutch fetch -all -crawlId 334
    FetcherJob: starting
    FetcherJob: fetching all
    Fetcher: Your 'http.agent.name' value should be listed first in 'http.robots.agents' property.
    FetcherJob: threads: 10
    FetcherJob: parsing: false
    FetcherJob: resuming: false
    FetcherJob : timelimit set for : -1
    Using queue mode : byHost
    Fetcher: threads: 10
    QueueFeeder finished: total 1 records. Hit by time limit :0
    Fetcher: throughput threshold: -1
    Fetcher: throughput threshold sequence: 5
    fetching http://stackoverflow.com/ (queue crawl delay=5000ms)
    -finishing thread FetcherThread1, activeThreads=8
    -finishing thread FetcherThread7, activeThreads=7
    -finishing thread FetcherThread6, activeThreads=6
    -finishing thread FetcherThread5, activeThreads=5
    -finishing thread FetcherThread4, activeThreads=4
    -finishing thread FetcherThread3, activeThreads=3
    -finishing thread FetcherThread2, activeThreads=2
    -finishing thread FetcherThread8, activeThreads=1
    -finishing thread FetcherThread9, activeThreads=1
    -finishing thread FetcherThread0, activeThreads=0
    0/0 spinwaiting/active, 1 pages, 0 errors, 0.2 0 pages/s, 102 102 kb/s, 0 URLs in 0 queues
    -activeThreads=0
    FetcherJob: done
    
    (2)查看数据库
    见db1.txt
    新增f:bas,column=f:cnt,column=f:prot,f:pts,f:st,f:ts,f:typ,h:Cache-Control,h:Connection,h:Content-Encoding,h:Content-Length, h:Content-Type,h:Date,h:Expires, h:Last-Modified,h:Set-Cookie,h:Vary,h:X-Frame-Options, mk:_ftcmrk_等字段
    4、ParserJob
    (1)基本命令
    [jediael@jediael local]$ bin/nutch parse  -all -crawlId 334
    ParserJob: starting
    ParserJob: resuming:    false
    ParserJob: forced reparse:      false
    ParserJob: parsing all
    Parsing http://stackoverflow.com/
    ParserJob: success
    (2)命令参数
    [root@jediael local]# bin/nutch parse 
    Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
        <batchId>     - symbolic batch ID created by Generator
        -crawlId <id> - the id to prefix the schemas to operate on, 
                        (default: storage.crawl.id)
        -all          - consider pages from all crawl jobs
        -resume       - resume a previous incomplete job
        -force        - force re-parsing even if a page is already parsed
    (3)查看数据库
    见db_parse.txt
    新增了很多类似column=ol:http://stackoverflow.com/help的列,在此例中共有115个。
    
    
    5、DbUpdaterJob
    (1)基本命令
    [jediael@jediael local]$ bin/nutch updatedb -crawlId 334
    DbUpdaterJob: starting
    DbUpdaterJob: done
    (2)查看数据库
    见db_updatedb.txt
    解释了上述的115个column=ol:http,并生成了115行新数据,举其中一个例子如下:
    com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=x00'x8Dx00                                  
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=x00x00x00x01                               
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=x00x00x01Hx0CBxD4x09                     
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<x0Ex5                                  
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<x0Ex5                                         
     44974/silviu-oncioiu                                                                                                       
     com.stackoverflow:http/users/39 column=f:fi, timestamp=1408954979355, value=x00'x8Dx00                                  
     74525/laosi                                                                                                                
     com.stackoverflow:http/users/39 column=f:st, timestamp=1408954979355, value=x00x00x00x01                               
     74525/laosi                                                                                                                
     com.stackoverflow:http/users/39 column=f:ts, timestamp=1408954979355, value=x00x00x01Hx0CBxD4x09                     
     74525/laosi                                                                                                                
     com.stackoverflow:http/users/39 column=mk:dist, timestamp=1408954979355, value=1                                           
     74525/laosi                                                                                                                
     com.stackoverflow:http/users/39 column=mtdt:_csh_, timestamp=1408954979355, value=<x0Ex5                                  
     74525/laosi                                                                                                                
     com.stackoverflow:http/users/39 column=s:s, timestamp=1408954979355, value=<x0Ex5                                         
     74525/laosi 
    此时数据已准备好,等待下一轮的抓取。
    6、SolrIndexerJob
    (1)基本命令
    [jediael@jediael local]$  bin/nutch solrindex http://****/solr/  -all -crawlId 334
    SolrIndexerJob: starting
    Adding 1 documents
    SolrIndexerJob: done.
    (2)命令参数
    [root@jediael local]# bin/nutch solrindex 
    Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
    (3)查看数据库
    无变化
    
    
    
    
    
    
    
    
    
    
    

  • 相关阅读:
    phpexcel读取excel文件
    laravel的表单验证(下面有些信息未验证,转的)
    Laravel框架学习(Response)
    laravel文件存储Storage
    Laravel 5 教程
    js中substring和substr的用法
    jquery 规范
    总结oninput、onchange与onpropertychange事件的用法和区别
    jquery 选择器的总结
    jquery选择器中的find和空格,children和>的区别、及父节点兄弟节点,还有判断是否存在的写法
  • 原文地址:https://www.cnblogs.com/eaglegeek/p/4557861.html
Copyright © 2020-2023  润新知