Hive分区表添加列

2018-03-24 Franky Hive Hive, partition

1ALTER TABLE table_name 
2  [PARTITION partition_spec]                 -- (Note: Hive 0.14.0 and later)
3  ADD COLUMNS (col_name data_type [COMMENT col_comment], ...)
4  [CASCADE|RESTRICT]                         -- (Note: Hive 1.1.0 and later)

用户可以给分区表增加新的字段，默认是加在分区字段之前，已有字段之后。

 1hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
 2OK
 3Time taken: 0.152 seconds
 4hive (test) desc test_add_col_part;
 5OK
 6cardid                  string                                      
 7tx_date                 string                                      
 8                 
 9# Partition Information          
10# col_name              data_type               comment             
11                 
12tx_date                 string                                      
13Time taken: 0.183 seconds, Fetched: 7 row(s)
14hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
15OK
16Time taken: 0.274 seconds
17hive (test)> desc test_add_col_part;
18OK
19cardid                  string                                      
20busid                   string                  新增字段              
21tx_date                 string                                      
22                 
23# Partition Information          
24# col_name              data_type               comment             
25                 
26tx_date                 string                                      
27Time taken: 0.181 seconds, Fetched: 8 row(s)

通过这种方式，无法在已有字段的开始或者中间增加新字段。如果需要调整可以使用语句ALTER TABLE table_name CHANGE COLUMN old_name new_name TYPE [FIRST|AFTER col_name]

此语句还可对字段进行重命名，类型，注释。即使字段名或者字段类型没有改变，用户也需要完全指定旧的字段名，并给出新的字段名及新的字段类型。将字段移到第一个位置使用FIRSTR关键字，移到某个字段名之后使用AFTER col_name。

 1hive (test)> alter table test_add_col_part change busid busid string first;
 2OK
 3Time taken: 0.606 seconds
 4hive (test)> desc test_add_col_part;
 5OK
 6busid                   string                  新增字段                
 7cardid                  string                                      
 8tx_date                 string                                      
 9                 
10# Partition Information          
11# col_name              data_type               comment             
12                 
13tx_date                 string                                      
14Time taken: 0.162 seconds, Fetched: 8 row(s)

分区表增加字段需加上关键字CASCADE，加上后会更新表和所有分区元数据信息，默认是RESTRICT，只改变表的元数据信息。

The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE ADD|REPLACE COLUMNS with CASCADE command changes the columns of a table’s metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column changes only to table metadata.

下面从数据、表结构、元数据信息查看CASCADE作用:

准备数据。

 1hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
 2OK
 3Time taken: 0.145 seconds
 4hive (test)> insert into table test_add_col_part  partition (tx_date='20180325')
 5           > select cardid  from test_csv;
 6OK
 7Time taken: 19.493 seconds
 8hive (test)> select * from test_add_col_part  where tx_date='20180325' limit 4;
 9OK
1070F8CF7C5C507F2516147ADAA7765A80        20180325
118AA64E54F0D85F9E3A11D8BEC6684D73        20180325
1238E9497B43759F699AC0158700F862A5        20180325
13CC07331E11F3412953B2EC8E181D92D4        20180325

增加列不带CASCADE，之前已存在的分区，该字段返回NULL，即使是使用OVERWRITE重新覆盖分区，仍为空。

 1hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
 2OK
 3Time taken: 0.307 seconds
 4hive (test)> select * from test_add_col_part  where tx_date='20180325' limit 4;
 5OK
 670F8CF7C5C507F2516147ADAA7765A80        NULL    20180325
 78AA64E54F0D85F9E3A11D8BEC6684D73        NULL    20180325
 838E9497B43759F699AC0158700F862A5        NULL    20180325
 9CC07331E11F3412953B2EC8E181D92D4        NULL    20180325
10Time taken: 0.118 seconds, Fetched: 4 row(s)
11hive (test)> insert overwrite table test_add_col_part  partition (tx_date='20180325')
12           > select cardid,busid  from test_csv;
13OK
14Time taken: 19.198 seconds
15hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
16OK
1770F8CF7C5C507F2516147ADAA7765A80        NULL    20180325
188AA64E54F0D85F9E3A11D8BEC6684D73        NULL    20180325
1938E9497B43759F699AC0158700F862A5        NULL    20180325
20CC07331E11F3412953B2EC8E181D92D4        NULL    20180325
21Time taken: 0.458 seconds, Fetched: 4 row(s)

新产生的分区插入数据才会有具体的值。

 1hive (test)> insert into table test_add_col_part  partition (tx_date='20180326')
 2           > select cardid,busid  from test_csv;
 3OK
 4Time taken: 20.641 seconds
 5hive (test)> select * from test_add_col_part where tx_date='20180326' limit 4;
 6OK
 770F8CF7C5C507F2516147ADAA7765A80        00017499        20180326
 88AA64E54F0D85F9E3A11D8BEC6684D73        00017499        20180326
 938E9497B43759F699AC0158700F862A5        00017499        20180326
10CC07331E11F3412953B2EC8E181D92D4        00017499        20180326
11Time taken: 0.178 seconds, Fetched: 4 row(s)

如果添加列时增加关键字CASCADE，旧分区overwrite覆盖时会有数据。

 1hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') CASCADE;
 2OK
 3Time taken: 0.392 seconds
 4hive (test)> insert overwrite table test_add_col_part  partition (tx_date='20180325')
 5           > select cardid,busid  from test_csv;
 6OK
 7Time taken: 18.802 seconds
 8hive (test)> select * from test_add_col_part where tx_date='20180325' limit 4;
 9OK
1070F8CF7C5C507F2516147ADAA7765A80        00017499        20180325
118AA64E54F0D85F9E3A11D8BEC6684D73        00017499        20180325
1238E9497B43759F699AC0158700F862A5        00017499        20180325
13CC07331E11F3412953B2EC8E181D92D4        00017499        20180325
14Time taken: 0.125 seconds, Fetched: 4 row(s)

观察增加列前后的表结构，不加CASCADE主体表表结构改变，新分区表结构改变。旧分区表结构没有改变。

 1hive (test)> drop table test_add_col_part;
 2OK
 3Time taken: 0.66 seconds
 4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
 5OK
 6Time taken: 0.124 seconds
 7hive (test)> alter table test_add_col_part add partition (tx_date='20180325');
 8OK
 9Time taken: 0.235 seconds
10hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') ;
11OK
12Time taken: 0.553 seconds
13hive (test)> desc test_add_col_part;
14OK
15cardid                  string                                      
16busid                   string                  新增字段                
17tx_date                 string                                      
18                 
19# Partition Information          
20# col_name              data_type               comment             
21                 
22tx_date                 string                                      
23Time taken: 0.152 seconds, Fetched: 8 row(s)
24hive (test)> desc test_add_col_part partition (tx_date='20180325');
25OK
26cardid                  string                                      
27tx_date                 string                                      
28                 
29# Partition Information          
30# col_name              data_type               comment             
31                 
32tx_date                 string                                      
33Time taken: 0.175 seconds, Fetched: 7 row(s)
34hive (test)> alter table test_add_col_part add partition (tx_date='20180326');
35OK
36Time taken: 0.214 seconds
37hive (test)> desc test_add_col_part partition (tx_date='20180326');
38OK
39cardid                  string                                      
40busid                   string                  新增字段                
41tx_date                 string                                      
42                 
43# Partition Information          
44# col_name              data_type               comment             
45                 
46tx_date                 string                                      
47Time taken: 0.385 seconds, Fetched: 8 row(s)

观察Hive元数据表SDS，不加CASCADE修改分区表结构后，元数据库表SDS的CD_ID会改变，但是该表旧分区下面对应的CD_ID还是原来表的CD_ID。

 1hive (test)> drop table test_add_col_part;
 2OK
 3Time taken: 0.557 seconds
 4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
 5OK
 6Time taken: 0.163 seconds
 7hive (test)> alter table test_add_col_part add partition(tx_date='20180325');
 8OK
 9Time taken: 0.299 seconds
10hive (test)> alter table test_add_col_part add partition(tx_date='20180326');
11OK
12Time taken: 0.201 seconds
13hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段');
14OK
15Time taken: 0.272 seconds
16hive (test)> alter table test_add_col_part add partition(tx_date='20180327');
17OK
18Time taken: 0.383 seconds

MySQL数据库中Hive元数据表SDS字段CD_ID变化情况：

 1root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
 2+-------+-------+-------------------------------------------------------------------+
 3| sd_id | cd_id | location                                                          |
 4+-------+-------+-------------------------------------------------------------------+
 5|  1920 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
 6+-------+-------+-------------------------------------------------------------------+
 71 row in set (0.00 sec)
 8
 9root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
10+-------+-------+------------------------------------------------------------------------------------+
11| sd_id | cd_id | location                                                                           |
12+-------+-------+------------------------------------------------------------------------------------+
13|  1920 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
14|  1921 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
15+-------+-------+------------------------------------------------------------------------------------+
162 rows in set (0.00 sec)
17
18root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
19+-------+-------+------------------------------------------------------------------------------------+
20| sd_id | cd_id | location                                                                           |
21+-------+-------+------------------------------------------------------------------------------------+
22|  1920 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
23|  1921 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
24|  1922 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
25+-------+-------+------------------------------------------------------------------------------------+
263 rows in set (0.00 sec)
27
28root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
29+-------+-------+------------------------------------------------------------------------------------+
30| sd_id | cd_id | location                                                                           |
31+-------+-------+------------------------------------------------------------------------------------+
32|  1920 |   740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
33|  1921 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
34|  1922 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
35+-------+-------+------------------------------------------------------------------------------------+
363 rows in set (0.00 sec)
37
38root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
39+-------+-------+------------------------------------------------------------------------------------+
40| sd_id | cd_id | location                                                                           |
41+-------+-------+------------------------------------------------------------------------------------+
42|  1920 |   740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
43|  1921 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
44|  1922 |   739 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
45|  1923 |   740 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180327 |
46+-------+-------+------------------------------------------------------------------------------------+
474 rows in set (0.02 sec)

带CASCADE时，旧分区对应的CD_ID与主体表的CD_ID都改变且依次递增，新分区的CD_ID沿用主体表更新后的CD_ID。

 1hive (test)> drop table test_add_col_part;
 2OK
 3Time taken: 0.336 seconds
 4hive (test)> create table test_add_col_part ( cardid string) partitioned by (tx_date string);
 5OK
 6Time taken: 0.119 seconds
 7hive (test)> alter table test_add_col_part add partition(tx_date='20180325');
 8OK
 9Time taken: 0.283 seconds
10hive (test)> alter table test_add_col_part add partition(tx_date='20180326');
11OK
12Time taken: 0.182 seconds
13hive (test)> alter table test_add_col_part add columns(busid string comment '新增字段') cascade;
14OK
15Time taken: 0.306 seconds
16hive (test)> alter table test_add_col_part add partition(tx_date='20180327');
17OK
18Time taken: 0.426 seconds

MySQL数据库中Hive元数据表SDS字段CD_ID变化情况：

 1root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
 2+-------+-------+-------------------------------------------------------------------+
 3| sd_id | cd_id | location                                                          |
 4+-------+-------+-------------------------------------------------------------------+
 5|  1930 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part |
 6+-------+-------+-------------------------------------------------------------------+
 71 row in set (0.00 sec)
 8
 9root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
10+-------+-------+------------------------------------------------------------------------------------+
11| sd_id | cd_id | location                                                                           |
12+-------+-------+------------------------------------------------------------------------------------+
13|  1930 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
14|  1931 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
15+-------+-------+------------------------------------------------------------------------------------+
162 rows in set (0.00 sec)
17
18root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
19+-------+-------+------------------------------------------------------------------------------------+
20| sd_id | cd_id | location                                                                           |
21+-------+-------+------------------------------------------------------------------------------------+
22|  1930 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
23|  1931 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
24|  1932 |   743 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
25+-------+-------+------------------------------------------------------------------------------------+
263 rows in set (0.00 sec)
27
28root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
29+-------+-------+------------------------------------------------------------------------------------+
30| sd_id | cd_id | location                                                                           |
31+-------+-------+------------------------------------------------------------------------------------+
32|  1930 |   746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
33|  1931 |   744 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
34|  1932 |   745 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
35+-------+-------+------------------------------------------------------------------------------------+
363 rows in set (0.02 sec)
37
38root@localhost [hive]>select sd_id,cd_id,location from SDS where LOCATION like 'hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part%';
39+-------+-------+------------------------------------------------------------------------------------+
40| sd_id | cd_id | location                                                                           |
41+-------+-------+------------------------------------------------------------------------------------+
42|  1930 |   746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part                  |
43|  1931 |   744 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180325 |
44|  1932 |   745 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180326 |
45|  1933 |   746 | hdfs://nameservice1/user/hive/warehouse/test.db/test_add_col_part/tx_date=20180327 |
46+-------+-------+------------------------------------------------------------------------------------+
474 rows in set (0.00 sec)

总结：

Hive分区表增加字段会导致新增字段无法显示时解决方案：

方案1：表结构变更时增加关键字CASCADE。
方案2：删除原分区表，重新创建分区插入数据。若旧分区太多，则考虑重建原表吧。
方案3：将新字段数据拼接到旧字段上去，不修改表结构。
方案4：修改Hive元数据表SDS中cd_id的值。不建议！

Franky's Blog

书山有路勤为径，学海无涯苦作舟……

Hive分区表添加列