Hadoop-0.20.2+ Nutch-1.2+Tomcat-7——分布式搜索配置
随着nutch的发展,各模块逐渐独立性增强,我从2.1到1.6装过来,也没有实现整个完整的功能。今天装一下nutch1.2,这应该是最后一个有war文件的稳定版本。
1. 准备工作
下载apache-nutch-1.2-bin.zip、apache-tomcat-7.0.39.tar.gz、hadoop-0.20.2.tar.gz。
将下载的hadoop-0.20.2.tar.gz解压到/opt文件夹下。
将下载的apache-nutch-1.2-bin.zip解压到/opt文件夹下。
将下载的apache-tomcat-7.0.39.tar.gz解压到/opt文件夹下。
2. 配置hadoop-0.20.2
(1) 编辑conf/hadoop-env.sh,最后添加
export JAVA_HOME=/opt/java-7-sun export HADOOP_HEAPSIZE=1000 export HADOOP_CLASSPATH=.:/opt/nutch-1.2/lib:/opt/hadoop-0.20.2 export NUTCH_HOME=/opt/nutch-1.2/lib |
(2) 编辑/etc/profile,添加
#Hadoop export HADOOP_HOME=/opt/hadoop-0.20.2 export PATH=$PATH:$HADOOP_HOME/bin |
(3) 编辑conf/core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://m2:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop-0.20.2/tempdata/var</value> </property> <property> <name>hadoop.native.lib</name> <value>true</value> <description>Should native hadoop libraries, if present, be used.</description> </property> </configuration> |
(4) 编辑conf/hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.name.dir</name> <value>/opt/hadoop-0.20.2/tempdata/name1,/opt/hadoop-1.0.4/tempdata/name2</value> #hadoop的name目录路径 <description> </description> </property> <property> <name>dfs.data.dir</name> <value>/opt/hadoop-0.20.2/tempdata/data1,/opt/hadoop-1.0.4/tempdata/data2</value> <description> </description> </property> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration> |
(5) 编辑conf/mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>m2:9001</value> </property> <property> <name>mapred.local.dir</name> <value>/opt/hadoop-0.20.2/tempdata/var</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> <description>If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. </description> </property> <property> <name>mapred.output.compress</name> <value>true</value> <description>Should the job outputs be compressed? </description> </property> <property> <name>mapred.compress.map.output</name> <value>true</value> </property> |
(6) 将conf/master和conf/slave文件写好。启动hadoop(hadoop命令在hadoop目录的bin目录下)
a) hadoop namenode –format b) start-all.sh |
(7) 在WEB下查看Hadoop的工作情况
a) http://localhost:50070 b) http://localhost:50030 |
3. 配置nutch-1.2
(1) 建立nutch1.2/urls/url.txt,里面加入
http://www.163.com/ http://www.tianya.cn/ http://www.renren.com/ http://www.iteye.com/ |
(2) 编辑conf/crawl-urlfilter.txt
# accept hosts in MY.DOMAIN.NAME #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/ +^http://([a-z0-9]*\.)*163.com/ +^http://([a-z0-9]*\.)*tianya.cn/ +^http://([a-z0-9]*\.)*renren.com/ +^http://([a-z0-9]*\.)*iteye.com/ |
(3) 配置conf/nutch-site.xml
<configuration> <property> <name>http.agent.name</name> <value>aniu-search</value> <description>aniu.160</description> </property> <property> <name>http.agent.description</name> <value></value> <description></description> </property> <property> <name>http.agent.url</name> <value></value> <description></description> </property> <property> <name>http.agent.email</name> <value></value> <description></description> </property> <property> <name>searcher.dir</name> <value> search-dir</value>//此文件位置为hdfs://m2:9000/user/root </property> </configuration> |
(4) 将hadoop/conf中的文件覆盖到nutch/conf下。
(5) 建立文件search-dir/search-servers.txt加入内容
m2 9999 s6 9999 s7 9999 |
4. 配置tomcat
(1) 将nutch-1.2下的nutch-1.2.war复制到/tomcat下的webapps中,在浏览器中http://localhost:8080/nutch-1.2/,则可以看到nutch的搜索界面。(这一步测试要做,会生成相应的目录)
(2) 解决中文乱码的问题,编辑tomcat/conf/server.xml,找到并添加
<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" useBodyEncodingForURI="true"/> |
(3) 将nutch/conf中的文件覆盖到/opt/apache-tomcat-7.0.39/webapps/nutch-1.2/WEB-INF/classes下。
5. 分布式爬取测试和分布式搜索测试
(1) 将urls文件夹上传到hdfs://m2:9000/user/root下。
(2) 执行爬取命令:bin/nutch crawl urls -dir crawl -depth 5 -threads 10 -topN 100,若结果hdfs://m2:9000/user/root下生成目录crawl(5个子目录),则爬取成功。
(3) 开启各节点的检索端口
[root@m2 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl [root@s6 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl [root@s7 nutch-1.2]# bin/nutch server 9999 hdfs://m2:9000/user/root/crawl |
(4) 将search-dir/search-servers.txt上传到在HDFS上,hdfs://m2:9000/user/root/search-dir/search-servers.txt
(5) 重启tomcat,
/opt/apache-tomcat-7.0.39/bin/catalina.sh stop /opt/apache-tomcat-7.0.39/bin/catalina.sh start |
(6) 在浏览器http://10.1.50.160:8080/nutch-1.2/search.jsp中测试,搜索成功。
6. FAQ
(1) 运行爬取命令报错java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.SnappyCodec was not found.
解决办法:是把hadoop-0.20.2/conf/mapread-site.xml中的snappy部分去掉。
(2) 运行爬取命令bin/nutch crawl hdfs://m2:9000/user/root/nutch-1.2/urls -dir hdfs:/m2:9000/user/root/nutch-1.2/crawl -depth 5 -threads 10 -topN 100报错:
java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) |
咨询了大脸,这个错解决了。方法是
将conf/nutch-default.xml中的相关项还原
<property> <name>plugin.folders</name> <value>plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> |
(3) 报错
Error: org.apache.nutch.scoring.ScoringFilters.injectedScore(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)V
解决方法:
到s6上看tasktracker的log,发现有这么一条,查了下fatal,竟然是“致命的”。。。但也没给出具体原因。
接着找,从浏览器m2:50030进去,找到刚执行的task(中间要重启hadoop),找到报错
2013-04-13 07:45:38,046 ERROR org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.NoSuchMethodError: org.apache.nutch.scoring.ScoringFilters.injectedScore(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)V
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:141)
at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:59)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
在网上查到,NoSuchMethodError常是因为有功能相同的包在工作,版本又不一致,才导致这个错误。我在看bin/nutch时发现,会读取nutch-1.2/build中的plugins,这就和nutch-1.2/plugins冲突了,即将build这个文件夹重命名,问题就解决了。
(4) 关于分布式搜索,不能搜索HDFS上的内容
终于发现这个WARN,原来是找不到文件,
解决办法:把这个文件夹传到HDFS上解决。
7. 主要参考资料:
1. 《nutch入门学习.pdf》
2. http://hi.baidu.com/erliang20088
3. http://lendfating.blog.163.com/blog/static/1820743672012111311532359/
4. http://blog.csdn.net/witsmakemen/article/details/8256369