大家好,欢迎来到IT知识分享网。
一、Hive安装部署
1、下载:apache-hive-1.2.2-bin.tar.gz 软件包,并上传到slave2服务器 /usr/src目录下
2、解压 :tar -zxvf apache-hive-1.2.2-bin.tar.gz
3、重命名:mv apache-hive-1.2.2.-bin hive-1.2.2
4、修改hive配置:
cd /usr/src/hive-1.2.2/conf
cp hive-env.sh.template hive-env.sh touch hive-site.xml
5、vi hive-site.xml ,添加以下配置
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://slave1:3306/hivecreateDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/usr/hive/warehouse</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/usr/src/hive-1.2.2/data/tmp</value>
</property>
<property>
<name>hive.querylog.location</name>
<value>/usr/src/hive-1.2.2/data/log</value>
</property>
</configuration>
新建文件夹
cd /usr/src/hive-1.2.2
mkdir data
cd data
mkdir tmp
mkdir log
6、vi hive-env.sh ,添加如下配置
export JAVA_HOME=/usr/src/jdk1.8.0_191
export HADOOP_HOME=/usr/src/hadoop-2.7.3
export HIVE_HOME=/usr/src/hive-1.2.2
export HIVE_CONF_DIR=/usr/src/hive-1.2.2/conf
7、增加环境变量
#slave2
vi /etc/profile ,添加hive的环境变量
export HIVE_HOME=/usr/src/hive-1.2.2
export PATH=$PATH:$HIVE_HOME/bin
刷新环境变量 source /etc/profile
一、MySQL 安装部署
1、安装
前提条件 虚拟机能连外网,且以安装wget
#slave1
1)、wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm
2)、wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm 3)、rpm -ivh mysql-community-release-el7-5.noarch.rpm 4)、yum install mysql-server
2、启动
service mysqld start
mysqladmin -uroot password hadoop //设置root用户密码
mysql -uroot -phadoop
use mysql
select host,user,password from user;
mysql> update user set host=’%’ where user=’root’ and host = ‘localhost’;
mysql> flush privileges ;
mysql>exit ;
如果需要单独为Hive创建用户,则以上部分需要注意的地方为:
create user ‘hive’ identified by ‘hive123’;
grant all privileges on *.* to ‘hive’@% with grant option ;
flush privileges;
三、启动Hive
1、下载mysql driver包 mysql-connector-java-5.1.47.jar,并上传到安装hive的服务器slave2上,/usr/src/hive-1.2.2/lib 下
2、输入命令 hive 则可启动hive服务
四、使用Hive
1、HIVE Shell 命令行
2、本地系统命令行
创建表
create table userinfo (id int ,name string) row format delimited fields terminated by ‘\t’;
create table choice (userid int ,classname string ) row format delimited fields terminated by ‘\t’;
create table calssinfo (teacher string ,classname string ) row format delimited fields terminated by ‘\t’
row format delimited fields terminated by ‘\t’;是Hive特有的语句,用来指定数据的分割方式,如果不人为指定,则默认格式如下
row format delimited field terminated by ‘\001’ collection items terminated by ‘\002’ map keys terminated by ‘003’ lines terminated by ‘\n’ stored as textfile
创建日志表
create table if not exists loginfo11 (rdate string ,time array<string>,type string ,relateclass string ,infomation1 string ,infomation2 string ,infomation3 string ) row format delimited fields terminated by ‘ ‘ collection items terminated by ‘,’ map keys terminated by ‘:’ ;
导入数据
–1、创建数据库
create database badou ;
–2、建表(内部表,数据存放在hive的warehouse目录下)
create table article (sentence string)
row format delimited fields terminated by ‘\n’;
–3、查看表结构
desc article ;
–4、查看表数据
select * from article limit 3 ;
–5、导入数据
load data local inpath ‘/root/The_man_of_property.txt’ into table article;
–6、worcount
–spilt操作
将文本按空格分割
select split (‘we may be sure that tribal instinct was even then the prime force’,’ ‘);
返回结果:
OK
[“we”,”may”,”be”,”sure”,”that”,”tribal”,”instinct”,”was”,”even”,”then”,”the”,”prime”,”force”]
Time taken: 0.661 seconds, Fetched: 1 row(s)
–explode操作:将数组中的每个元素,单独列成一行
— 一行分割成一列 (行转列)
select explode (split (‘we may be sure that tribal instinct was even then the prime force’,’ ‘));
— 知识点:
set hive.cli.print.header=true; –设置之后会将列名显示出来
–检验列名是否出现
select explode (split(sentence ,’ ‘)) as word from article ;
–统计每个单词出现的次数
select word ,count(*) cnt from (
select explode (split(sentence ,’ ‘)) as word from article
) t
group by word ;
–使用正则表达式,优化输出结果
select regexp_extract(‘”prime’,'[A-Za-z]+’,0) as word ;
–去除多余字符,使输出结果更清晰
select regexp_extract(word,'[[A-Za-z]]+’,0) ,count(*) cnt from
( select explode (split(sentence ,’ ‘)) as word from article ) t
group by word ;
–大小写转换,并降序排列
–desc 表示降序
select lower(regexp_extract(word,'[[A-Za-z]]+’,0)) ,count(*) cnt from
( select explode (split(sentence ,’ ‘)) as word from article ) t
group by word order by cnt desc limit 10;
–或者
select regexp_extract(word,'[[A-Za-z]]+’,0) ,count(*) cnt from
( select explode (split(lower(sentence) ,’ ‘)) as word from article ) t
group by word ;
–创建外部表
–格式如下:
–create external table location ‘hdfs_path’ (hdfs_path是已经在hdfs上的数据)
首先需要将服务器上的文件上传到HDFS上
hadoop fs -put /root/The_man_of_property.txt /input/
create external TABLE art_ext (sentence string )
ROW format delimited fields terminated by ‘\n’ location ‘/input/’;
–如果location 是 ‘/input/The_man_of_property.txt’,创建表的时候,会报错
create external TABLE art_ext (sentence string )
ROW format delimited fields terminated by ‘\n’ location ‘/input/The_man_of_property.txt’;
结论:location 后面的路径,只能是文件夹(目录),不能是文件
外部表和内部表的区别
1、创建语句
内部表:create table ;
外部表:create external table location ‘hdfs_path’ ;
2、删除时的区别
删除内部表,会删除元数据信息及hdfs上的hive表
删除外部表,只会删除元数据信息,数据还会在HDFS。
加 external 如果删除的话,hive 上会删除,但hdfs上不会删除
总结:
1、在导入数据到外部表的时候,数据并没有移动到自己仓库的目录下,还是在hdfs上,也就是说外部表的数据不是由它自己来管理
2、在删除数据的时候,内部表会把元数据和数据全部删掉,而外部表则只删除元数据,数据还会保留在HDFS上。
–创建外部表
create external table art_ext
–partition
–1.创建分区表(在创建表的时候指定是分区表)
create table art_dt(sentence string)
partitioned by (dt string)
row format delimited fields terminated by ‘\n’;
–2.插入数据
insert into table art_dt partition(dt=’20200104′)
select * from art_ext limit 100;
insert into art_dt partition(dt=’20200105′)
select * from art_ext limit 100;
–查询
select * from art_dt where dt=’20200107′ limit 10 ;
insert overwrite table art_dt partition (dt=’20200106′)
select * from art_ext limit 100;
–查看对应表的分区
show partitions art_dt;
hive 中的table 可以拆分成partition ,table 和partition 可以通过’clustered by ‘ 进一步拆分成bucket ,bucket中的数据可以通过’sort by ‘排序。
在hive cli命令行中输入 :set hive.enforce.bucketing=true ,可以自动控制上一轮reduce的数量从而适配bucket的个数
用户也可以通过mapred.reduce.tasks去适配bucket的个数
bucket的主要作用
1.通过bucket进行数据采样
2.提升某些查询操作效率,例如mapside join 等。
set hive.enforce.bucketing=true;
create table bucket_user (id int) clustered by (id) into 32 buckets ;
create table bucket_test (id int);
load data local inpath
‘/root/bucket_test/bucket_test.txt’
into table bucket_test ;
insert overwrite table bucket_user
select id from bucket_test;
hadoop fs -ls /usr/hive/warehouse/badou.db/bucket_user
hadoop fs -cat /usr/hive/warehouse/badou.db/bucket_user/000000_0
32
分区与分桶的区别
1、分区是分文件夹、分桶是分文件
2、分桶的粒度比分区更细
–采样(sampling)
select * from bucket_user tablesample (bucket 1 out of 32 on id );
tablesample (bucket x out of y on id );
y:必须是table总bucket数的倍数或者因子,hive 根据y的大小,决定抽样比例
x:表示从哪个bucket开始抽取
例如:table总的bucket数为32 tablesample (bucket 1 out of 16 on id ) ,表示总共抽取 32/16 =2 个bucket的数据,分别为第1个和第17个。
Hive的数据类型
原生类型:TINYINTTINYINT 、SMALLINT、INT、BIGINT、BOOLEAN、FLOAT、DOUBEL、STRING、BINARY(hive0.8版本以上)、TIMESTAMP(hive0.8版本以上)
复合类型:Arrays、Maps 、Structs、Union
–练习
–数据
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
431534,1,prior,5,4,15,28.0
3367565,1,prior,6,2,07,19.0
550135,1,prior,7,1,09,20.0
3108588,1,prior,8,1,14,14.0
2295261,1,prior,9,1,16,0.0
2550362,1,prior,10,4,08,30.0
–建表
create table orders (
order_id string,
user_id string,
eval_set string,
order_number string,
order_dow string,
order_hour_of_day string,
days_since_prior_order string
) row format delimited fields terminated by ‘,’;
load data local inpath
‘/root/hive/data/orders.csv’
into table orders ;
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
2,30035,5,0
2,17794,6,1
2,40141,7,1
2,1819,8,1
2,43668,9,0
3,33754,1,1
create table proirs (
order_id string,
product_id string,
add_to_cart_order string,
reordered string
)row format delimited fields terminated by ‘,’;
注意:如果mapred.max.split.size<dfs.block.size ,最终以dfs.block.size 大小分割文件;
如果mapred.max.split.size>dfs.block.size ,最终以mapred.max.split.size的大小分割大文件
一般情况下,dfs.block.size 在初始化时就应该要定好,否则在后面改的时候会比较麻烦
能够修改的就只有 mapred.max.split.size
免责声明:本站所有文章内容,图片,视频等均是来源于用户投稿和互联网及文摘转载整编而成,不代表本站观点,不承担相关法律责任。其著作权各归其原作者或其出版社所有。如发现本站有涉嫌抄袭侵权/违法违规的内容,侵犯到您的权益,请在线联系站长,一经查实,本站将立刻删除。 本文来自网络,若有侵权,请联系删除,如若转载,请注明出处:https://yundeesoft.com/6199.html