Yichuan Cai 's technique blog: Nutch初体验(ZZ)

http://blog.csdn.net/setok/articles/499791.aspx

前几天看到卢亮的 Larbin 一种高效的搜索引擎爬虫工具一文提到 http://hedong.3322.org/">竹笋炒肉中对 Nutch 进行了一下介绍。

Nutch vs Lucene
Lucene 不是完整的应用程序，而是一个用于实现全文检索的软件库。
Nutch 是一个应用程序，可以以 Lucene 为基础实现搜索引擎应用。

Nutch vs GRUB
http://www.wespoke.com/archives/000879.php">这里］
Nutch 则还可以存储到数据库并建立索引。
Nutch Architecture.png
［引自这里］

Nutch 的早期版本不支持中文搜索，而最新的版本(2004-Aug-04 发布了 http://www.dbanotes.net)为例，先进行一下针对企业内部网的测试。

在 nutch 目录中创建一个包含该网站顶级网址的文件 urls ，包含如下内容：

http://www.dbanotes.net/

然后编辑conf/crawl-urlfilter.txt 文件，设定过滤信息，我这里只修改了MY.DOMAIN.NAME:

# accept hosts in MY.DOMAIN.NAME+^http://([a-z0-9]*\.)*dbanotes.net/

运行如下命令开始抓取分析网站内容：

[root@fc3 nutch]# bin/nutch crawl urls -dir crawl.demo -depth 2 -threads 4 >& crawl.log

depth 参数指爬行的深度，这里处于测试的目的，选择深度为 2 ；
threads 参数指定并发的进程这是设定为 4 ；

在该命令运行的过程中，可以从 crawl.log 中查看 nutch 的行为以及过程:

......050102 200336 loading file:/u01/nutch/conf/nutch-site.xml050102 200336 crawl started in: crawl.demo 050102 200336 rootUrlFile = urls 050102 200336 threads = 4050102 200336 depth = 2050102 200336 Created webdb at crawl.demo/db......050102 200336 loading file:/u01/nutch/conf/nutch-site.xml050102 200336 crawl started in: crawl.demo050102 200336 rootUrlFile = urls050102 200336 threads = 4050102 200336 depth = 2050102 200336 Created webdb at crawl.demo/db050102 200336 Starting URL processing050102 200336 Using URL filter: net.nutch.net.RegexURLFilter......                               050102 200337 Plugins: looking in: /u01/nutch/plugins                  050102 200337 parsing: /u01/nutch/plugins/parse-html/plugin.xml        050102 200337 parsing: /u01/nutch/plugins/parse-pdf/plugin.xml         050102 200337 parsing: /u01/nutch/plugins/parse-ext/plugin.xml         050102 200337 parsing: /u01/nutch/plugins/parse-msword/plugin.xml      050102 200337 parsing: /u01/nutch/plugins/query-site/plugin.xml        050102 200337 parsing: /u01/nutch/plugins/protocol-http/plugin.xml     050102 200337 parsing: /u01/nutch/plugins/creativecommons/plugin.xml050102 200337 parsing: /u01/nutch/plugins/language-identifier/plugin.xml050102 200337 parsing: /u01/nutch/plugins/query-basic/plugin.xml       050102 200337 logging at INFO                                          050102 200337 fetching http://www.dbanotes.net/                        050102 200337 http.proxy.host = null                                   050102 200337 http.proxy.port = 8080                                   050102 200337 http.timeout = 10000                                     050102 200337 http.content.limit = 65536                               050102 200337 http.agent = NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)050102 200337 fetcher.server.delay = 1000                              050102 200337 http.max.delays = 100                                    050102 200338 http://www.dbanotes.net/: setting encoding to GB18030    050102 200338 CC: found http://creativecommons.org/licenses/by-nc-sa/2.0/ in rdf of http://www.dbanotes.net/050102 200338 CC: found text in http://www.dbanotes.net/               050102 200338 status: 1 pages, 0 errors, 12445 bytes, 1067 ms          050102 200338 status: 0.9372071 pages/s, 91.12142 kb/s, 12445.0 bytes/page050102 200339 Updating crawl.demo/db                                   050102 200339 Updating for crawl.demo/segments/20050102200336          050102 200339 Finishing update                                                                                                                64,1           7%050102 200337 parsing: /u01/nutch/plugins/query-basic/plugin.xml050102 200337 logging at INFO050102 200337 fetching http://www.dbanotes.net/050102 200337 http.proxy.host = null050102 200337 http.proxy.port = 8080050102 200337 http.timeout = 10000050102 200337 http.content.limit = 65536050102 200337 http.agent = NutchCVS/0.05 (Nutch; http://www.nutch.org/docs/en/bot.html; nutch-agent@lists.sourceforge.net)050102 200337 fetcher.server.delay = 1000050102 200337 http.max.delays = 100......

之后配置 Tomcat (我的 tomcat 安装在 /opt/Tomcat) ，

[root@fc3 nutch]# rm -rf /opt/Tomcat/webapps/ROOT*[root@fc3 nutch]# cp nutch*.war /opt/Tomcat/webapps/ROOT.war[root@fc3 webapps]# cd /opt/Tomcat/webapps/[root@fc3 webapps]# jar xvf ROOT.war[root@fc3 webapps]# ../bin/catalina.sh start

浏览器中输入 http://localhost:8080/ 查看结果(远程查看需要将 localhost 换成相应的IP)：

nutch web search interface.png

搜索测试：

nutch web search result.png

可以看到，Nutch 亦提供快照功能。下面进行中文搜索测试:

nutch web Chinese search result.png

注意结果中的那个“评分详解”，是个很有意思的功能(Nutch 具有一个链接分析模块)，通过这些数据可以进一步理解该算法。

考虑到带宽的限制，暂时不对整个Web爬行的方式进行了测试了。值得一提的是，在测试的过程中，nutch 的爬行速度还是不错的(相对我的糟糕带宽)。

Nutch 目前还不支持 PDF(开发中，不够完善) 与图片等对象的搜索。中文分词技术还不够好，通过“评分详解”可看出，对中文，比如“数据库管理员”，是分成单独的字进行处理的。但作为一个开源搜索引擎软件，功能是可圈可点的。毕竟，主要开发者 < color="#0000cc">Doug Cutting 就是开发 http://hedong.3322.org/archives/000247.html">试用Nutch

车东的 http://www.chedong.com/tech/lucene.html">Lucene：基于Java的全文检索引擎简介

Yichuan Cai 's technique blog

2007年3月2日星期五

Nutch初体验(ZZ)

没有评论:

欢迎使用MSN spaces朋友圈挖掘软件（1。0 Beta版）

猪年测试你的运气

website statistics

google the world

推荐Google Adsence

推荐大家使用firefox吧, 比IE更安全,快捷

推荐Google AdWords

博客归档

Home page

订阅我的博客(subscribe my blog)

标签