Nutch关于robot.txt的处理分类： H3_NUTCH 2015-01-28 11:20 472人阅读评论(0) 收藏

Nutch关于robot.txt的处理分类： H3_NUTCH 2015-01-28 11:20 472人阅读评论(0) 收藏
在nutch中，默认情况下尊重robot.txt的配置，同时不提供配置项以忽略robot.txt。
以下是其中一个解释。即作为apache的一个开源项目，必须遵循某些规定，同时由于开放了源代码，可以简单的通过修改源代码来忽略robot.txt的限制。

From the point of view of research and crawling certain pieces of the web, and i strongly agree with you that it should be configurable. But because Nutch being an Apache project, i dismiss it (arguments available upon request). We should adhere to some ethics, it is bad enough that we can just DoS a server by setting some options to a high level. We publish source code, it leaves the option open to everyone to change it, and i think the current situation is balanced enough.
Patching it is simple, i think we should keep it like that :)

以下为修改源代码的方法：【未验证】
修改类org.apache.nutch.fetcher.FetcherReducer.java
将以下内容注释掉：
```
       if (!rules.isAllowed(fit.u.toString())) {
              // unblock
              fetchQueues.finishFetchItem(fit, true);
              if (LOG.isDebugEnabled()) {
                LOG.debug("Denied by robots.txt: " + fit.url);
              }
              output(fit, null, ProtocolStatusUtils.STATUS_ROBOTS_DENIED,
                  CrawlStatus.STATUS_GONE);
              continue;
            }
```
版权声明：本文为博主原创文章，未经博主允许不得转载。
相关阅读:
使用MySQL Workbench建立数据库，建立新的表，向表中添加数据
 IntelliJ IDEA15开发时设置中java complier 的问题
 IntelliJ 15 unmapped spring configuration files found
Redis 的性能
 SSH框架
 jquery插件模版
 cygwin,在win中开发linux程序
 MinGw与CyGwin
升级到tomcat8时Artifact SpringMvcDemo:war exploded: Server is not connected. Deploy is not
Socket连接超时（转）
原文地址：https://www.cnblogs.com/lujinhong2/p/4637239.html

Nutch关于robot.txt的处理 分类： H3_NUTCH 2015-01-28 11:20 472人阅读 评论(0) 收藏

Nutch关于robot.txt的处理分类： H3_NUTCH 2015-01-28 11:20 472人阅读评论(0) 收藏