Hive分区表添加列
1ALTER TABLE table_name
2 [PARTITION partition_spec] -- (Note: Hive 0.14.0 and later)
3 ADD COLUMNS (col_name data_type [COMMENT col_comment], ...)
4 [CASCADE|RESTRICT] -- (Note: Hive 1.1.0 and later)
用户可以给分区表增加新的字段,默认是加在分区字段之前,已有字段之后。
1hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
2OK
3Time taken: 0.152 seconds
4hive (test) desc test_add_col_part;
5OK
6cardid string
7tx_date string
8
9# Partition Information
10# col_name data_type comment
11
12tx_date string
13Time taken: 0.183 seconds, Fetched: 7 row(s)
14hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
15OK
16Time taken: 0.274 seconds
17hive (test)> desc test_add_col_part;
18OK
19cardid string
20busid string 新增字段
21tx_date string
22
23# Partition Information
24# col_name data_type comment
25
26tx_date string
27Time taken: 0.181 seconds, Fetched: 8 row(s)
通过这种方式,无法在已有字段的开始或者中间增加新字段。如果需要调整可以使用语句ALTER TABLE table_name CHANGE COLUMN old_name new_name TYPE [FIRST|AFTER col_name]
此语句还可对字段进行重命名,类型,注释。即使字段名或者字段类型没有改变,用户也需要完全指定旧的字段名,并给出新的字段名及新的字段类型。将字段移到第一个位置使用FIRSTR
关键字,移到某个字段名之后使用AFTER col_name
。
1hive (test)> alter table test_add_col_part change busid busid string first;
2OK
3Time taken: 0.606 seconds
4hive (test)> desc test_add_col_part;
5OK
6busid string 新增字段
7cardid string
8tx_date string
9
10# Partition Information
11# col_name data_type comment
12
13tx_date string
14Time taken: 0.162 seconds, Fetched: 8 row(s)
分区表增加字段需加上关键字CASCADE
,加上后会更新表和所有分区元数据信息,默认是RESTRICT
,只改变表的元数据信息。
The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE ADD|REPLACE COLUMNS with CASCADE command changes the columns of a table’s metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column changes only to table metadata.
下面从数据、表结构、元数据信息查看CASCADE
作用:
准备数据。
1hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
2OK
3Time taken: 0.145 seconds
4hive (test)> insert into table test_add_col_part partition (tx_date='20180325')
5 > select cardid from test_csv;
6OK
7Time taken: 19.493 seconds
8hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
9OK
1070F8CF7C5C507F2516147ADAA7765A80 20180325
118AA64E54F0D85F9E3A11D8BEC6684D73 20180325
1238E9497B43759F699AC0158700F862A5 20180325
13CC07331E11F3412953B2EC8E181D92D4 20180325
增加列不带CASCADE
,之前已存在的分区,该字段返回NULL
,即使是使用OVERWRITE
重新覆盖分区,仍为空。
1hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
2OK
3Time taken: 0.307 seconds
4hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
5OK
670F8CF7C5C507F2516147ADAA7765A80 NULL 20180325
78AA64E54F0D85F9E3A11D8BEC6684D73 NULL 20180325
838E9497B43759F699AC0158700F862A5 NULL 20180325
9CC07331E11F3412953B2EC8E181D92D4 NULL 20180325
10Time taken: 0.118 seconds, Fetched: 4 row(s)
11hive (test)> insert overwrite table test_add_col_part partition (tx_date='20180325')
12 > select cardid,busid from test_csv;
13OK
14Time taken: 19.198 seconds
15hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
16OK
1770F8CF7C5C507F2516147ADAA7765A80 NULL 20180325
188AA64E54F0D85F9E3A11D8BEC6684D73 NULL 20180325
1938E9497B43759F699AC0158700F862A5 NULL 20180325
20CC07331E11F3412953B2EC8E181D92D4 NULL 20180325
21Time taken: 0.458 seconds, Fetched: 4 row(s)
新产生的分区插入数据才会有具体的值。
1hive (test)> insert into table test_add_col_part partition (tx_date='20180326')
2 > select cardid,busid from test_csv;
3OK
4Time taken: 20.641 seconds
5hive (test)> select * from test_add_col_part where tx_date='20180326' limit 4;
6OK
770F8CF7C5C507F2516147ADAA7765A80 00017499 20180326
88AA64E54F0D85F9E3A11D8BEC6684D73 00017499 20180326
938E9497B43759F699AC0158700F862A5 00017499 20180326
10CC07331E11F3412953B2EC8E181D92D4 00017499 20180326
11Time taken: 0.178 seconds, Fetched: 4 row(s)
如果添加列时增加关键字CASCADE
,旧分区overwrite覆盖时会有数据。
1hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') CASCADE;
2OK
3Time taken: 0.392 seconds
4hive (test)> insert overwrite table test_add_col_part partition (tx_date='20180325')
5 > select cardid,busid from test_csv;
6OK
7Time taken: 18.802 seconds
8hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
9OK
1070F8CF7C5C507F2516147ADAA7765A80 00017499 20180325
118AA64E54F0D85F9E3A11D8BEC6684D73 00017499 20180325
1238E9497B43759F699AC0158700F862A5 00017499 20180325
13CC07331E11F3412953B2EC8E181D92D4 00017499 20180325
14Time taken: 0.125 seconds, Fetched: 4 row(s)
观察增加列前后的表结构,不加CASCADE
主体表表结构改变,新分区表结构改变。旧分区表结构没有改变。
1hive (test)> drop table test_add_col_part;
2OK
3Time taken: 0.66 seconds
4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
5OK
6Time taken: 0.124 seconds
7hive (test)> alter table test_add_col_part add partition (tx_date='20180325');
8OK
9Time taken: 0.235 seconds
10hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') ;
11OK
12Time taken: 0.553 seconds
13hive (test)> desc test_add_col_part;
14OK
15cardid string
16busid string 新增字段
17tx_date string
18
19# Partition Information
20# col_name data_type comment
21
22tx_date string
23Time taken: 0.152 seconds, Fetched: 8 row(s)
24hive (test)> desc test_add_col_part partition (tx_date='20180325');
25OK
26cardid string
27tx_date string
28
29# Partition Information
30# col_name data_type comment
31
32tx_date string
33Time taken: 0.175 seconds, Fetched: 7 row(s)
34hive (test)> alter table test_add_col_part add partition (tx_date='20180326');
35OK
36Time taken: 0.214 seconds
37hive (test)> desc test_add_col_part partition (tx_date='20180326');
38OK
39cardid string
40busid string 新增字段
41tx_date string
42
43# Partition Information
44# col_name data_type comment
45
46tx_date string
47Time taken: 0.385 seconds, Fetched: 8 row(s)
观察Hive元数据表SDS,不加CASCADE
修改分区表结构后,元数据库表SDS的CD_ID会改变,但是该表旧分区下面对应的CD_ID还是原来表的CD_ID。
1hive (test)> drop table test_add_col_part;
2OK
3Time taken: 0.557 seconds
4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
5OK
6Time taken: 0.163 seconds
7hive (test)> alter table test_add_col_part add partition(tx_date='20180325');
8OK
9Time taken: 0.299 seconds
10hive (test)> alter table test_add_col_part add partition(tx_date='20180326');
11OK
12Time taken: 0.201 seconds
13hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
14OK
15Time taken: 0.272 seconds
16hive (test)> alter table test_add_col_part add partition(tx_date='20180327');
17OK
18Time taken: 0.383 seconds
MySQL数据库中Hive元数据表SDS字段CD_ID变化情况:
1root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
2+-------+-------+-------------------------------------------------------------------+
3| sd_id | cd_id | location |
4+-------+-------+-------------------------------------------------------------------+
5| 1920 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
6+-------+-------+-------------------------------------------------------------------+
71 row in set (0.00 sec)
8
9root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
10+-------+-------+------------------------------------------------------------------------------------+
11| sd_id | cd_id | location |
12+-------+-------+------------------------------------------------------------------------------------+
13| 1920 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
14| 1921 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
15+-------+-------+------------------------------------------------------------------------------------+
162 rows in set (0.00 sec)
17
18root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
19+-------+-------+------------------------------------------------------------------------------------+
20| sd_id | cd_id | location |
21+-------+-------+------------------------------------------------------------------------------------+
22| 1920 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
23| 1921 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
24| 1922 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
25+-------+-------+------------------------------------------------------------------------------------+
263 rows in set (0.00 sec)
27
28root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
29+-------+-------+------------------------------------------------------------------------------------+
30| sd_id | cd_id | location |
31+-------+-------+------------------------------------------------------------------------------------+
32| 1920 | 740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
33| 1921 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
34| 1922 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
35+-------+-------+------------------------------------------------------------------------------------+
363 rows in set (0.00 sec)
37
38root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
39+-------+-------+------------------------------------------------------------------------------------+
40| sd_id | cd_id | location |
41+-------+-------+------------------------------------------------------------------------------------+
42| 1920 | 740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
43| 1921 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
44| 1922 | 739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
45| 1923 | 740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180327 |
46+-------+-------+------------------------------------------------------------------------------------+
474 rows in set (0.02 sec)
带CASCADE
时,旧分区对应的CD_ID与主体表的CD_ID都改变且依次递增,新分区的CD_ID沿用主体表更新后的CD_ID。
1hive (test)> drop table test_add_col_part;
2OK
3Time taken: 0.336 seconds
4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
5OK
6Time taken: 0.119 seconds
7hive (test)> alter table test_add_col_part add partition(tx_date='20180325');
8OK
9Time taken: 0.283 seconds
10hive (test)> alter table test_add_col_part add partition(tx_date='20180326');
11OK
12Time taken: 0.182 seconds
13hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') cascade;
14OK
15Time taken: 0.306 seconds
16hive (test)> alter table test_add_col_part add partition(tx_date='20180327');
17OK
18Time taken: 0.426 seconds
MySQL数据库中Hive元数据表SDS字段CD_ID变化情况:
1root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
2+-------+-------+-------------------------------------------------------------------+
3| sd_id | cd_id | location |
4+-------+-------+-------------------------------------------------------------------+
5| 1930 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
6+-------+-------+-------------------------------------------------------------------+
71 row in set (0.00 sec)
8
9root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
10+-------+-------+------------------------------------------------------------------------------------+
11| sd_id | cd_id | location |
12+-------+-------+------------------------------------------------------------------------------------+
13| 1930 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
14| 1931 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
15+-------+-------+------------------------------------------------------------------------------------+
162 rows in set (0.00 sec)
17
18root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
19+-------+-------+------------------------------------------------------------------------------------+
20| sd_id | cd_id | location |
21+-------+-------+------------------------------------------------------------------------------------+
22| 1930 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
23| 1931 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
24| 1932 | 743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
25+-------+-------+------------------------------------------------------------------------------------+
263 rows in set (0.00 sec)
27
28root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
29+-------+-------+------------------------------------------------------------------------------------+
30| sd_id | cd_id | location |
31+-------+-------+------------------------------------------------------------------------------------+
32| 1930 | 746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
33| 1931 | 744 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
34| 1932 | 745 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
35+-------+-------+------------------------------------------------------------------------------------+
363 rows in set (0.02 sec)
37
38root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
39+-------+-------+------------------------------------------------------------------------------------+
40| sd_id | cd_id | location |
41+-------+-------+------------------------------------------------------------------------------------+
42| 1930 | 746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
43| 1931 | 744 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
44| 1932 | 745 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
45| 1933 | 746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180327 |
46+-------+-------+------------------------------------------------------------------------------------+
474 rows in set (0.00 sec)
总结:
Hive分区表增加字段会导致新增字段无法显示时解决方案:
- 方案1:表结构变更时增加关键字
CASCADE
。 - 方案2:删除原分区表,重新创建分区插入数据。若旧分区太多,则考虑重建原表吧。
- 方案3:将新字段数据拼接到旧字段上去,不修改表结构。
- 方案4:修改Hive元数据表SDS中cd_id的值。不建议!