博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
nutch 设置抓取间隔策略
阅读量:6005 次
发布时间:2019-06-20

本文共 6960 字,大约阅读时间需要 23 分钟。

http://caols.diandian.com/post/2012-06-05/40028026285 http://blog.csdn.net/witsmakemen/article/details/7799546   这个是相关的代码的分析

 

昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。在CrawlDbReducer中的reduce函数:[java] view plaincopy    case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch      case CrawlDatum.STATUS_FETCH_REDIR_TEMP:      // successful fetch, redirected      case CrawlDatum.STATUS_FETCH_REDIR_PERM:      case CrawlDatum.STATUS_FETCH_NOTMODIFIED:     // successful fetch, notmodified        // determine the modification status        int modified = FetchSchedule.STATUS_UNKNOWN;        if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {          modified = FetchSchedule.STATUS_NOTMODIFIED;        } else {          if (oldSet && old.getSignature() != null && signature != null) {            if (SignatureComparator._compare(old.getSignature(), signature) != 0) {              modified = FetchSchedule.STATUS_MODIFIED;            } else {              modified = FetchSchedule.STATUS_NOTMODIFIED;            }          }        }        // set the schedule        System.err.println("1:result.fetchtime="+result.getFetchTime());        result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,            prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);        // set the result status and signature        System.err.println("2:result.fetchtime="+result.getFetchTime());              if (modified == FetchSchedule.STATUS_NOTMODIFIED) {          result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);          if (oldSet) result.setSignature(old.getSignature());        } else {          switch (fetch.getStatus()) {          case CrawlDatum.STATUS_FETCH_SUCCESS:            result.setStatus(CrawlDatum.STATUS_DB_FETCHED);            break;          case CrawlDatum.STATUS_FETCH_REDIR_PERM:            result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM);            break;          case CrawlDatum.STATUS_FETCH_REDIR_TEMP:            result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP);            break;          default:            LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status.");            if (oldSet) result.setStatus(old.getStatus());            else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);          }          result.setSignature(signature);          if (metaFromParse != null) {              for (Entry
e : metaFromParse.entrySet()) { result.getMetaData().put(e.getKey(), e.getValue()); } } } // if fetchInterval is larger than the system-wide maximum, trigger // an unconditional recrawl. This prevents the page to be stuck at // NOTMODIFIED state, when the old fetched copy was already removed with // old segments. if (maxInterval < result.getFetchInterval()) result = schedule.forceRefetch((Text)key, result, false); break; 通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:[java] view plaincopy public class DefaultFetchSchedule extends AbstractFetchSchedule { @Override public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state) { // System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"); datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime, fetchTime, modifiedTime, state); if (datum.getFetchInterval() == 0 ) { datum.setFetchInterval(defaultInterval); } datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000); datum.setModifiedTime(modifiedTime); return datum; } } 可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
主要参考资料 1 http://caols.diandian.com/post/2012-06-05/40028026285 2 http://blog.csdn.net/witsmakemen/article/details/7799546 3 AdaptiveFetchSchedule 类的文档阅读 关键属性解析:Interval 1 InjectMapper中 interval = jobConf.getInt("db.fetch.interval.default", 2592000); 2 int customInterval = interval; customInterval = Integer.parseInt(metavalue); 或者从URL的interval中设置 3 CrawlDatum datum = new CrawlDatum(CrawlDatum.STATUS_INJECTED,customInterval); 被放入CrawlDataum 4 CrawlDb update() 中的  CrawlDbReducer 中的 result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified); 5 schedule默认是调用 DefaultFetchSchedule  这里可以设置下一次采集的日期,方法是:设置索引页过期时间1800 (半小时) 设置 内容页过期时间(7776000 90天)             
 
db.fetch.interval.default
 
86400
 
this usage change to be the interval of index page   this property has wasted by wqj 2013-1-18 The default number of seconds between re-fetches of a page (30 days).  
 
db.fetch.interval.content
 
7776000
 
interval for content page ,default is  (90 days).  
        /* modify time :2013-01-18 author wqj */         Configuration confCurrent=super.getConf();         int interval_index= confCurrent.getInt("db.fetch.interval.default", 86400);//默认为24小时         int interval_content=confCurrent.getInt("db.fetch.interval.content", 7776000);//默认为90天         String regi = datum.getMetaData().get(new Text("regi")).toString();         if (url.toString().matches(regi)) {
            datum.setFetchInterval(interval_index);         } else {
            datum.setFetchInterval(interval_content);         }         datum.setFetchTime(fetchTime + (long) datum.getFetchInterval()                 * 1000);         datum.setModifiedTime(modifiedTime);         return datum; 采集时间 2013-1-18 14 :30 左右 http://glcx.moc.gov.cn/CsckManageAction/cxckMoreInfo.do?byName=lkbs_newRoad&page=9    Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 18 14:59:53 CST 2013     这里的采集月份依然是1月份 但是也加了 半小时了 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 1800 seconds (0 days) http://glcx.moc.gov.cn/CsckManageAction/cxckInformationAction.do?infoId=8a8181d532b43eb40132b9076dd20253    Version: 7 Status: 2 (db_fetched) Fetch time: Thu Apr 18 14:26:26 CST 2013            采集月份是2013 1月份 这里 fetchtime 变为了 4月份 说明这个时间是加上 Retry interval的 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 7776000 seconds (90 days) Score: 0.009259259 Signature: e901870589199bb918c45c9c9fad0782 通过上述可以看出 fetchTime 事实上在上一次 采集的时候 已经计算好了 下一次将直接从 CrawlerDb中  读取

 

 

你可能感兴趣的文章
java 获取系统当前时间的方法
查看>>
Ubuntu 10.04升级git 到1.7.2或更高的可行方法
查看>>
MyBATIS(即iBATIS)问题集
查看>>
Linux下autoconf和automake使用
查看>>
UDP之socket编程
查看>>
Spring Security4实战与原理分析视频课程( 扩展+自定义)
查看>>
Centos6.5升级系统自带gcc4.4.7到gcc4.8.0
查看>>
redis安装与配置文件详解
查看>>
VMware安装失败 “Failed to create the requested registry key Key:installer Error:1021"
查看>>
虚拟化系列-VMware vSphere 5.1 VDP备份管理
查看>>
接口设计
查看>>
同步工具类 java.util.concurrent.CountDownLatch
查看>>
带动量因子的BP网络源码(C#实现)
查看>>
Skia深入分析9——延迟渲染和显示列表
查看>>
mmap函数实现共享内存
查看>>
java笔记
查看>>
贪吃蛇和俄罗斯方块软件
查看>>
消息队列服务器 memcacheq的搭建
查看>>
Bringing up interface eth0: Device eth0 does not seem to be present ,delayin
查看>>
解决输入ipconfig后出现ipconfig不是内部或外部命令
查看>>