• nutch 设置抓取间隔策略


    http://caols.diandian.com/post/2012-06-05/40028026285


    http://blog.csdn.net/witsmakemen/article/details/7799546 这个是相关的代码的分析
    昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。
    
    在CrawlDbReducer中的reduce函数:
    [java] view plaincopy
    
        case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch  
        case CrawlDatum.STATUS_FETCH_REDIR_TEMP:      // successful fetch, redirected  
        case CrawlDatum.STATUS_FETCH_REDIR_PERM:  
        case CrawlDatum.STATUS_FETCH_NOTMODIFIED:     // successful fetch, notmodified  
          // determine the modification status  
          int modified = FetchSchedule.STATUS_UNKNOWN;  
          if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {  
            modified = FetchSchedule.STATUS_NOTMODIFIED;  
          } else {  
            if (oldSet && old.getSignature() != null && signature != null) {  
              if (SignatureComparator._compare(old.getSignature(), signature) != 0) {  
                modified = FetchSchedule.STATUS_MODIFIED;  
              } else {  
                modified = FetchSchedule.STATUS_NOTMODIFIED;  
              }  
            }  
          }  
          // set the schedule  
          System.err.println("1:result.fetchtime="+result.getFetchTime());  
          result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,  
              prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);  
          // set the result status and signature  
          System.err.println("2:result.fetchtime="+result.getFetchTime());  
          
          if (modified == FetchSchedule.STATUS_NOTMODIFIED) {  
            result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);  
            if (oldSet) result.setSignature(old.getSignature());  
          } else {  
            switch (fetch.getStatus()) {  
            case CrawlDatum.STATUS_FETCH_SUCCESS:  
              result.setStatus(CrawlDatum.STATUS_DB_FETCHED);  
              break;  
            case CrawlDatum.STATUS_FETCH_REDIR_PERM:  
              result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM);  
              break;  
            case CrawlDatum.STATUS_FETCH_REDIR_TEMP:  
              result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP);  
              break;  
            default:  
              LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status.");  
              if (oldSet) result.setStatus(old.getStatus());  
              else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);  
            }  
            result.setSignature(signature);  
            if (metaFromParse != null) {  
                for (Entry<Writable, Writable> e : metaFromParse.entrySet()) {  
                  result.getMetaData().put(e.getKey(), e.getValue());  
                }  
              }  
          }  
          // if fetchInterval is larger than the system-wide maximum, trigger  
          // an unconditional recrawl. This prevents the page to be stuck at  
          // NOTMODIFIED state, when the old fetched copy was already removed with  
          // old segments.  
          if (maxInterval < result.getFetchInterval())  
            result = schedule.forceRefetch((Text)key, result, false);  
          break;  
    
    通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。
    
    CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:
    [java] view plaincopy
    
        public class DefaultFetchSchedule extends AbstractFetchSchedule {  
          
          @Override  
          public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,  
                  long prevFetchTime, long prevModifiedTime,  
                  long fetchTime, long modifiedTime, int state) {  
        //  System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");  
            datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,  
                fetchTime, modifiedTime, state);  
            if (datum.getFetchInterval() == 0 ) {  
              datum.setFetchInterval(defaultInterval);  
            }  
            datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);  
            datum.setModifiedTime(modifiedTime);  
            return datum;  
          }  
        }  
    
    可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
    主要参考资料
    1 http://caols.diandian.com/post/2012-06-05/40028026285
    2 http://blog.csdn.net/witsmakemen/article/details/7799546
    3 AdaptiveFetchSchedule 类的文档阅读

    关键属性解析:Interval
    1 InjectMapper中 interval = jobConf.getInt("db.fetch.interval.default", 2592000);
    2 int customInterval = interval; customInterval = Integer.parseInt(metavalue); 或者从URL的interval中设置
    3 CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_INJECTED,customInterval); 被放入CrawlDataum
    4 CrawlDb update() 中的  CrawlDbReducer 中的
    result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);
    5 schedule默认是调用 DefaultFetchSchedule  这里可以设置下一次采集的日期,方法是:设置索引页过期时间1800 (半小时) 设置 内容页过期时间(7776000 90天)

                <property>
      <name>db.fetch.interval.default</name>
      <value>86400</value>
      <description>this usage change to be the interval of index page
      this property has wasted by wqj 2013-1-18 The default number of seconds between re-fetches of a page (30 days).
      </description>
    </property>

    <property>
      <name>db.fetch.interval.content</name>
      <value>7776000</value>
      <description>interval for content page ,default is  (90 days).
      </description>
    </property>


            /* modify time :2013-01-18 author wqj */
            Configuration confCurrent=super.getConf();
            int interval_index= confCurrent.getInt("db.fetch.interval.default", 86400);//默认为24小时
            int interval_content=confCurrent.getInt("db.fetch.interval.content", 7776000);//默认为90天
            String regi = datum.getMetaData().get(new Text("regi")).toString();
            if (url.toString().matches(regi)) {
                datum.setFetchInterval(interval_index);
            } else {
                datum.setFetchInterval(interval_content);
            }
            datum.setFetchTime(fetchTime + (long) datum.getFetchInterval()
                    * 1000);
            datum.setModifiedTime(modifiedTime);
            return datum;


    采集时间 2013-1-18 14 :30 左右
    http://glcx.moc.gov.cn/CsckManageAction/cxckMoreInfo.do?byName=lkbs_newRoad&page=9    Version: 7
    Status: 2 (db_fetched)
    Fetch time: Fri Jan 18 14:59:53 CST 2013     这里的采集月份依然是1月份 但是也加了 半小时了
    Modified time: Thu Jan 01 08:00:00 CST 1970
    Retries since fetch: 0
    Retry interval: 1800 seconds (0 days)

    http://glcx.moc.gov.cn/CsckManageAction/cxckInformationAction.do?infoId=8a8181d532b43eb40132b9076dd20253    Version: 7
    Status: 2 (db_fetched)
    Fetch time: Thu Apr 18 14:26:26 CST 2013            采集月份是2013 1月份 这里 fetchtime 变为了 4月份 说明这个时间是加上 Retry interval的
    Modified time: Thu Jan 01 08:00:00 CST 1970
    Retries since fetch: 0
    Retry interval: 7776000 seconds (90 days)
    Score: 0.009259259
    Signature: e901870589199bb918c45c9c9fad0782

    通过上述可以看出
    fetchTime 事实上在上一次 采集的时候 已经计算好了
    下一次将直接从 CrawlerDb中  读取<当前时间的URL 进行采集   






    引申:
    如果想查看内容页 更新 比方一段时间后内容更改了,但是啊  我们那边是用 URL 布隆过滤器去重的,即使有更新也不会反应在数据库里啊
    是不是 RegexDumpParser 在这里 观察一下  更新的网页的  DUMP文件是否有什么特殊标记,如果是  有更新内容的文章  则改为UPDATE呢
  • 相关阅读:
    python 中文字符的处理
    python的一些内置函数
    python之命令行解析工具argparse
    牛客-小阳买水果
    牛客-小石的海岛之旅 (线性联通块)
    腾讯笔试-拼凑硬币
    2020腾讯笔试--Ice Cave
    2020-字节跳动笔试(树距离之和[距离按%3值不同,分为三类])
    2020-字节跳动笔试(最少工资)
    干物妹小埋(吉首大学2019)---线段树+dp
  • 原文地址:https://www.cnblogs.com/i80386/p/2864900.html
Copyright © 2020-2023  润新知