Nutch-MongoDB-ElasticSearch搭建搜索引擎

Nutch-MongoDB-ElasticSearch搭建搜索引擎存储到MongoDB中,通过mongodb-connector同步到ElasticSearch中,建立索引。

大家好,欢迎来到IT知识分享网。

一、实现目标

使用Nutch、MongoDB、ElasticSearch实现一个简单搜索引擎,使用Nutch爬虫抓取网页,存储到MongoDB中,通过mongodb-connector同步到ElasticSearch中,建立索引,通过RestFul API从ElasticSearch中检索内容。

二、实验环境

CentOS7 Linux x86_64

JDK 1.8.0_161

apache-ant-1.9.4

apache-nutch-2.3.1

mongodb 2.6.12-6

elasticsearch 6.2.2

mongo-connector 6.2.2

三、安装Oracle JDK

3.1卸载OpenJDK

yum list installed | grep java

yum remove java-1.8.0-openjdk-headless

yum remove javapackages-tools

yum remove python-javapackages

yum remove tzdata-java

3.2下载安装配置Oracle JDK 1.8.0_161

下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

rpm -ivh jdk-8u161-linux-x64.rpm

/etc/profile:

export JAVA_HOME=/usr/java/jdk1.8.0_161/

export JRE_HOME=$JAVA_HOME/jre

export CLASSPATH=$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH

source /etc/profile

四 、安装配置MongoDB

Nutch-MongoDB-ElasticSearch搭建搜索引擎

yum install mongodb-server

yum install mongodb

/etc/mongodb/shard1.conf

port=47017

replSet=rs1

fork=true

dbpath=/data/mongodb/shard1

logpath=/data/mongodb/logs/shard1.log

shardsvr=true

directoryperdb=true

/etc/mongodb/shard2.conf

port=47018

replSet=rs1

fork=true

dbpath=/data/mongodb/shard2

logpath=/data/mongodb/logs/shard2.log

shardsvr=true

directoryperdb=true

/etc/mongodb/shard3.conf

port=47019

replSet=rs1

fork=true

dbpath=/data/mongodb/shard3

logpath=/data/mongodb/logs/shard3.log

shardsvr=true

directoryperdb=true

/etc/mongodb/config1.conf:

port=37017

fork=true

dbpath=/data/mongodb/config1

logpath=/data/mongodb/logs/config1.log

configsvr=true

directoryperdb=true

/etc/mongodb/config2.conf:

port=37018

fork=true

dbpath=/data/mongodb/config2

logpath=/data/mongodb/logs/config2.log

configsvr=true

directoryperdb=true

/etc/mongodb/config3.conf:

port=37019

fork=true

dbpath=/data/mongodb/config3

logpath=/data/mongodb/logs/config3.log

configsvr=true

directoryperdb=true

/etc/mongodb/router1.conf:

port = 27017

fork = true

logpath = /data/mongodb/logs/router1.log

configdb = vminger:37017,vminger:37018,vminger:37019

maxConns =

logappend = true

/etc/mongodb/router2.conf:

port = 27018

fork = true

logpath = /data/mongodb/logs/router2.log

configdb = vminger:37017,vminger:37018,vminger:37019

maxConns =

logappend = true

/etc/mongodb/router3.conf:

port = 27019

fork = true

logpath = /data/mongodb/logs/router3.log

configdb = vminger:37017,vminger:37018,vminger:37019

maxConns =

logappend = true

启动shard1-3、config1-3、router1-3:

mongod -f /etc/mongodb/shard1.conf

mongod -f /etc/mongodb/shard2.conf

mongod -f /etc/mongodb/shard3.conf

mongod -f /etc/mongodb/config1.conf

mongod -f /etc/mongodb/config2.conf

mongod -f /etc/mongodb/config3.conf

mongos -f /etc/mongodb/router1.conf

mongos -f /etc/mongodb/router2.conf

mongos -f /etc/mongodb/router3.conf

开启sharding:

mongo –port 27017

>use admin

>db.runCommand({addshard:”vminger:47017″,allowLocal:true })

>db.runCommand({addshard:”vminger:47018″,allowLocal:true })

>db.runCommand({addshard:”vminger:47019″,allowLocal:true })

开启replica sets(mongo-connector同步数据需要):

mongo –port 47017

> config={_id:”rs1″,members:[{_id:0,host:”vminger:47017″}]}

> rs.initiate(config)

> rs.add(“vminger:47018”)

> rs.add(“vminger:47019”)

创建数据库和用户:

mongo –port 27017

>use test1

>db.createUser({user: “root”, pwd: “root”, roles: [{ role: “dbOwner”, db: “test1” }]})

五、安装配置ElasticSearch

Nutch-MongoDB-ElasticSearch搭建搜索引擎

下载elasticsearch-6.2.3.tar.gz,并解压,下载地址https://www.elastic.co/downloads

配置config/elasticsearch.yml:

cluster.name: vminger

node.name: node-1

path.data: /var/lib/elasticsearch

path.logs: /var/log/elasticsearch

network.host: 0.0.0.0

http.port: 9200

安装中文分词插件elasticsearch-analysis-ik-6.2.2.zip:https://github.com/medcl/elasticsearch-analysis-ik

解压到elasticsearch-6.2.2/plugins目录

进入elasticsearch-6.2.2目录,执行./bin/elasticsearch -d (注意:使用非root用户启动)

创建test1 index,开启ik中文分词(提示: mongo-conntor同步建索引时,未指定ik,使用了默认standard,而elasticsearch无法修改,所以一种方法提前建index,设置ik)

POST http://192.168.132.33:9200/test1

{

“settings” : {

“index” : {

“analysis.analyzer.default.type”: “ik_max_word”

}

}

}

六、安装配置mongo-connector

pip install mongo-connector

pip install elastic2-doc-manager

mongo-connector -m vminger:27017 -t vminger:9200 -d elastic2_doc_manager

七、安装配置nutch

Nutch-MongoDB-ElasticSearch搭建搜索引擎

下载apache-ant-1.9.4-bin.tar.gz,并解压,下载地址:

/etc/profile:

export ANT_HOME=/home/vminger/workspace/sysapp/ant/apache-ant-1.9.4

export PATH=$PATH:$ANT_HOME/bin

source /etc/profile

下载apache-nutch-2.3.1-src.tar.gz,并解压,下载地址:

http://nutch.apache.org/downloads.html

/etc/profile:

export NUTCH_HOME=/home/vminger/workspace/sysapp/nutch/apache-nutch-2.3.1/runtime/local

export PATH=$PATH:$NUTCH_HOME/bin

source /etc/profile

conf/nutch-site.xml:

<configuration>

<property>

<name>storage.data.store.class</name>

<value>org.apache.gora.mongodb.store.MongoStore</value>

<description>Default class for storing data</description>

</property>

<property>

<name>http.agent.name</name>

<value>Hist Crawler</value>

</property>

</configuration>

ivy/ivy.xml:

<dependency org=”org.apache.gora” name=”gora-mongodb” rev=”0.6.1″ conf=”*->default” />

conf/gora.properties:

gora.datastore.default=org.apache.gora.mongodb.store.MongoStore

gora.mongodb.override_hadoop_configuration=false

gora.mongodb.mapping.file=/gora-mongodb-mapping.xml

gora.mongodb.servers=vminger:27017

gora.mongodb.db=test1

gora.mongodb.login=root

gora.mongodb.secret=root

编译Nutch:ant runtime

设置抓取URL过滤规则:

conf/regex-urlfilter.txt:

+^http://([a-z0-9]*\.)*sina.com.cn/

设置URL种子:

runtime/local/urls/seed.ini

进入runtime/local目录,开始抓取,id1,深度为3:

./bin/crawl urls/ id1 3

使用RESTFul API查询内容:

POST

{

“query” : { “match” : { “text” : “中国” }},

“highlight” : {

“pre_tags” : [“<tag1>”, “<tag2>”],

“post_tags” : [“</tag1>”, “</tag2>”],

“fields” : {

“text” : {}

}

}

}

查询结果:

Nutch-MongoDB-ElasticSearch搭建搜索引擎

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/48418.html

(0)
上一篇 2024-09-09 05:20
下一篇 2024-09-14 13:45

相关推荐

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注

关注微信