Hive On Spark伪分布式开发环境搭建[通俗易懂]

Hive On Spark伪分布式开发环境搭建[通俗易懂]因为工作中需要用到Hive On Spark的模式,做数据仓库,但是由于开发环境的服务器资源较为紧张,目前不能将CDH部署到开发环境,毕竟CDH

大家好,欢迎来到IT知识分享网。

前言

因为工作中需要用到Hive On Spark的模式,做数据仓库,但是由于开发环境的服务器资源较为紧张,目前不能将CDH部署到开发环境,毕竟CDH整个安装下来32G内存估计也耗的快差不多了。因此准备安装原生的Hadoop,Hive,Spark,确实很久没有手动安装原生环境了。

今天分享一下安装过程:

开发环境的服务器的配置为:cpu 16核心,内存为32G

Hadoop前置安装

JDK安装

mkdir –p /home/module/java/

下载jdk-8u202-linux-x64.tar.gz到该目录

tar –zvxf jdk-8u202-linux-x64.tar.gz

jdk路径为:/home/module/java/jdk1.8.0_202

修改/etc/profile

export JAVA_HOME=/home/module/java/jdk1.8.0_202

export PATH=$PATH:$JAVA_HOME/bin

java –version命令进行验证

mysql安装

mysql安装略过,网上教程较多,也可以使用docker的方式进行安装

详见docker安装mysql教程,这里不再展开

Hive On Spark伪分布式开发环境搭建[通俗易懂]

免登陆

ssh-keygen -t rsa

cd ~/.ssh/

cat id_rsa.pub >> authorized_keys

chmod 600 ./authorized_keys

ssh localhost 进行验证

安装hadoop-3.3.1.tar.gz

将安装包放在/home/module

tar –zvxf hadoop-3.3.1.tar.gz

cd /home/module/hadoop-3.3.1/etc/hadoop

修改/etc/hosts

[root@data-dev-server hadoop]# cat /etc/hosts

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.1.10 data-dev-server

core-site.xml主要内容

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>file:/home/hadoop_data/tmp</value>

<description>Abase for other temporary directories.</description>

</property>

<property>

<name>fs.defaultFS</name>

<value>hdfs://data-dev-server:9000</value>

</property>

</configuration>

hdfs-site.xml 主要内容

<configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop/tmp/dfs/name</value>

</property>

<property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop/tmp/dfs/data</value>

</property>

<property>

<name>dfs.http.address</name>

<value>data-dev-server:50070</value>

</property>

</configuration>

mapred-site.xml

<configuration>

<!– 设置MR程序默认运行模式: yarn集群模式 local本地模式 –>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

<!– MR程序历史服务器端地址 –>

<property>

<name>mapreduce.jobhistory.address</name>

<value>data-dev-server:10020</value>

</property>

<!– 历史服务器web端地址 –>

<property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>data-dev-server:19888</value>

</property>

<property>

<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

<property>

<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

<property>

<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

</property>

</configuration>

yarn-site.xml

<configuration>

<!– Site specific YARN configuration properties –>

<!– 设置YARN集群主角色运行机器位置 –>

<property>

<name>yarn.resourcemanager.hostname</name>

<value>data-dev-server</value>

</property>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<!– 是否将对容器实施物理内存限制 –>

<property>

<name>yarn.nodemanager.pmem-check-enabled</name>

<value>false</value>

</property>

<!– 是否将对容器实施虚拟内存限制。 –>

<property>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

</property>

<!– 开启日志聚集 –>

<property>

<name>yarn.log-aggregation-enable</name>

<value>true</value>

</property>

<!– 设置yarn历史服务器地址 –>

<property>

<name>yarn.log.server.url</name>

<value>http://data-dev-server:19888/jobhistory/logs</value>

</property>

<!– 保存的时间7天 –>

<property>

<name>yarn.log-aggregation.retain-seconds</name>

<value>104800</value>

</property>

</configuration>

etc/profile

export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/home/module/hadoop-3.3.1

export HADOOP_MAPRED_HOME=${HADOOP_HOME}

export HADOOP_COMMON_HOME=${HADOOP_HOME}

export HADOOP_HDFS_HOME=${HADOOP_HOME}

export HADOOP_YARN_HOME=${HADOOP_HOME}

export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

export HIVE_HOME=/home/module/apache-hive-3.1.2-bin

export PATH=$PATH:$HIVE_HOME/bin

export SPARK_HOME=/home/module/spark-2.3.0-bin-without-hive

export PATH=$SPARK_HOME/bin:$PATH

初始化启动Hadoop

hdfs namenode –format

start-dfs.sh 启动hdfs

start-yarn.sh 启动yarn

jps看一下

NameNode SecondaryNameNode DataNode为hdfs进程

NodeManager ResourceManager 为yarn进程

Hadoop fs –ls / 命令验证一下hadoop安装成功

安装apache-hive-3.1.2-bin.tar.gz

tar –zvxf apache-hive-3.1.2-bin.tar.gz

cd /home/module/apache-hive-3.1.2-bin/conf

建立hive-site.xml

内容为

<?xml version=”1.0″ encoding=”UTF-8″ standalone=”no”?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>

<name>hive.metastore.warehouse.dir</name>

<value>/user/hive/warehouse</value>

<description>设置hdfs中的默认目录</description>

</property>

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://10.20.29.52:33061/hive?createDatabaseIfNotExist=true&useSSL=false</value>

<description>保存元数据的数据库连接</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>数据库驱动,需要拷贝到${HIVE_HOME}/lib目录</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>用户名和密码</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>123456</value>

<description>用户名和密码</description>

</property>

<property>

<name>hive.cli.print.header</name>

<value>true</value>

</property>

<property>

<name>hive.cli.print.current.db</name>

<value>true</value>

</property>

<!–元数据是否校验–>

<property>

<name>hive.metastore.schema.verification</name>

<value>false</value>

</property>

<property>

<name>system:user.name</name>

<value>root</value>

<description>user name</description>

</property>

<!– host –>

<property>

<name>hive.server2.thrift.bind.host</name>

<value>data-dev-server</value>

<description>Bind host on which to run the HiveServer2 Thrift service.</description>

</property>

<!– hs2端口 –>

<property>

<name>hive.server2.thrift.port</name>

<value>11000</value>

</property>

<property>

<name>hive.metastore.uris</name>

<value>thrift://data-dev-server:9083</value>

</property>

<!–Spark依赖位置,上面上传jar包的hdfs路径–>

<property>

<name>spark.yarn.jars</name>

<value>hdfs://data-dev-server:9000/spark/jars/*.jar</value>

</property>

<!–Hive执行引擎,使用spark–>

<property>

<name>hive.execution.engine</name>

<value>spark</value>

</property>

<!–Hive和spark连接超时时间–>

<property>

<name>hive.spark.client.connect.timeout</name>

<value>10000ms</value>

</property>

</configuration>

创建hive数据库

CREATE DATABASE `hive` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */;

CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘123456’;

GRANT ALL PRIVILEGES ON hive.* TO ‘hive’@’%’;

FLUSH PRIVILEGES;

启动Hive

cd /home/module/apache-hive-3.1.2-bin/conf

初始化hive元数据 ./schematool -dbType mysql -initSchema

启动 metastore服务

nohup hive –service metastore

使用hive命令行,进入

安装spark,实现mr引擎切换为spark

下载spark-2.3.0-bin-without-hive.tgz

tar –zvxf spark-2.3.0-bin-without-hive.tgz

cd /home/module/spark-2.3.0-bin-without-hive/conf

cp spark-defaults.conf.template spark-defaults.conf

修改spark-defaults.conf,内容为:

spark.master yarn

spark.home /home/module/spark-2.3.0-bin-without-hive

spark.eventLog.enabled true

spark.eventLog.dir hdfs://data-dev-server:9000/tmp/spark

spark.serializer org.apache.spark.serializer.KryoSerializer

spark.executor.memory 1g

spark.driver.memory 1g

spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers=”one two three”

spark.yarn.archive hdfs://data-dev-server:9000/spark/jars/spark2.3.0-without-hive-libs.jar

spark.yarn.jars hdfs://data-dev-server:9000/spark/jars/spark2.3.0-without-hive-libs.jar

将spark相关的jar包上传到hdfs

hadoop fs -mkdir -p /tmp/spark

hadoop fs -mkdir -p /spark/jars

cd /home/module/spark-2.3.0-bin-without-hive/jars

hadoop fs -put ./jars/* /spark/jars/

cd /home/module/spark-2.3.0-bin-without-hive

jar cv0f spark2.3.0-without-hive-libs.jar -C ./jars/ .

hadoop fs -put spark2.3.0-without-hive-libs.jar /spark/jars/

cd /home/module/spark-2.3.0-bin-without-hive/conf

修改spark-env.sh

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

export HADOOP_CONF_DIR={HADOOP_HOME}/etc/hadoop/

cp jars/* /home/module/apache-hive-3.1.2-bin/lib/

验证spark

spark-submit \

–class org.apache.spark.examples.SparkPi \

–master yarn \

–deploy-mode client \

–driver-memory 1G \

–num-executors 3 \

–executor-memory 1G \

–executor-cores 1 \

/home/module/spark-2.3.0-bin-without-hive/examples/jars/spark-examples_*.jar 10

验证hive on spark

创建一张表,select count(*) from dws_user;

可以看到使用的spark引擎,整合成功

Hive On Spark伪分布式开发环境搭建[通俗易懂]

免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/6191.html

(0)
上一篇 2022-12-17 17:30
下一篇 2022-12-17 17:50

相关推荐

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

关注微信