Understanding the MERGE Statement

You can use the MERGE statement to perform record-level INSERT and UPDATE operations efficiently within Hive tables.

The MERGE statement can be a key tool for Data Fabriccluster data management. It is based on ANSI-standard SQL.

Simple merge.maprdb.column.id is the join key

Consider merging the following example source and target tables:
Table 1. customer_source
id first_name last_name age
001 Dorothi Hogward 7777
002 Alex Bowee 7777
088 Robert Dowson 25
Table 2. customer_db_json_target
id first_name last_name age
001 John Smith 45
002 Michael Watson 27
003 Den Brown 33
You can use the following SQL-standard MERGE statement:
MERGE into customer_db_json_target trg 
USING customer_source src 
ON src.id = trg.id 
WHEN MATCHED THEN UPDATE SET age = src.age 
WHEN NOT MATCHED THEN 
INSERT VALUES (src.id, src.first_name, src.last_name, src.age);
The result is:
id first_name last_name age
001 John Smith 7777
002 Michael Watson 7777
003 Den Brown 33
088 Robert Dowson 25
NOTE
The age column is updated and a new id column is inserted.

Simple merge.maprdb.column.id is not the join key

Merging when merge.maprdb.column is not the join key is not recommended.

DELETE syntax in the MERGE statement

This section describes how to use the DELETE syntax in the MERGE statement for HPE Ezmeral Data Fabric Database JSON tables. Included are examples of usage and limitations.

Consider two tables: tgt which is the target table of the MERGE statement, and src, which is the source table from which data will be taken. Both tables use MapRDBJsonStorageHandler to store data. The following table shows the initial contents of the tgt table:
Table 3. Initial contents of tgt table before MERGE
id Value
1 AAA
2 BBB
3 CCC
4 DDD
5 EEE
The following table shows the initial contents of the src table:
Table 4. Initial contents of src table
id Value
1 AAA
222 BBB---
3 CCC
444 DDD---
5 EEE
The following merge statement contains a WHEN MATCHED THEN DELETE clause. It means that if the id from the tgt table equals the id from the src table, the row is removed from the tgt table. When the value of id does not match, a new row is inserted into the tgt table:
MERGE INTO tgt 
USING src ON tgt.id=src.id 
WHEN MATCHED THEN DELETE 
WHEN NOT MATCHED THEN INSERT VALUES (src.id, src.value);
The following table shows the result of the merge:
Table 5. Contents of tgt table after merge
id Value
2 BBB
222 BBB---
4 DDD
444 DDD---
Here we removed rows with id 1, 3, and 5 from the tgt table because they existed in the src table, and they matched values from the tgt table. We did not touch rows with id 2 and 4, because there were no such values in the src table. We inserted new rows with id values 222 and 444 because they existed in the src table and did not exist in the tgt table.

Limitations

The preceding solution has three limitations:
  1. Subqueries are not supported as a source when DELETE is used.
  2. The source table should be a HPE Ezmeral Data Fabric Database JSON table when deletion is used in a MERGE operator.
  3. The DELETE operator is not supported with additional conditions after WHEN MATCHED. Use either a single UPDATE or DELETE.
Limitation #3 means that queries like the following are not supported:
MERGE INTO tgt
USING src
ON tgt._id = src._id
WHEN MATCHED AND [boolean expression1] THEN DELETE
WHEN MATCHED AND [boolean expression2] THEN UPDATE
WHEN NOT MATCHED THEN INSERT

Multiple source rows match a given target row (cardinality violation)

Consider merging the two tables customer_db_json and customer_new:
Table 6. customer_db_json
id first_name last_name age
001 John Smith 45
002 Michael Watson 27
003 Den Brown 33
And:
Table 7. customer_new
id first_name last_name age
001 Dorothi Hogward 77
001 Dorothi Hogward 77
088 Robert Dowson 25
To MERGE customer_new and customer_db_json:
MERGE INTO customer_db_json trg 
USING customer_new src ON src.id = trg.id 
WHEN MATCHED THEN UPDATE 
SET first_name = src.first_name, 
last_name = src.last_name 
WHEN NOT MATCHED THEN INSERT VALUES
(src.id, src.first_name, src.last_name, src.age);

This example causes an exception because of duplicate values in the id column in the customer_new table:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Error evaluating cardinality_violation(_col0)

To avoid cardinality violation, set hive.merge.cardinality.check=false, but in this case the result is unpredictable because there is no rule that defines the order of duplicated data that will be inserted by using the MERGE statement.

Merge on mixed data types

The merge operation also supports mixed data types, such as arrays, maps, and structures.

Consider two tables mixed_types_source and mixed_types_target:
Table 8. mixed_types_source
doc_id user_info
1 {"name":"Brandon","surname":"Lee","age":31,"gender":"MALE"}
2 {"name":"Johnson","surname":"Fall","age":23,"gender":"MALE"}
3 {"name":"Mary","surname":"Dowson","age":11,"gender":"FEMALE"}
4 {"name":"Paul","surname":"Rodgers","age":41,"gender":"MALE"}
And:
Table 9. mixed_types_target
id user_info
1 {"name":"Lexx","surname":"Comfuzer","age":31,"gender":"MALE"}
To merge mixed_types_source and mixed_types_target:
MERGE INTO mixed_types_target trg 
USING mixed_types_source src 
ON src.doc_id = old.doc_id 
WHEN MATCHED THEN UPDATE 
SET user_info = src.user_info 
WHEN NOT MATCHED THEN INSERT VALUES 
(src.doc_id, src.user_info);
The result is:
Table 10. Result of MERGE operator
id first_name
1 {"name":"Brandon","surname":"Lee","age":31,"gender":"MALE"}
2 {"name":"Johnson","surname":"Fall","age":23,"gender":"MALE"}
3 {"name":"Mary","surname":"Dowson","age":11,"gender":"FEMALE"}
4 {"name":"Paul","surname":"Rodgers","age":41,"gender":"MALE"}
Note that you cannot update only a part of a complex structure field. For example, suppose you have a structure stored as one field in a Hive table:
{"name":"Johnson","surname":"Fall","age":23,"gender":"MALE"}

You cannot update only the age field in the structure. You can only replace all values of the structure with new ones. For details, see Understanding the UPDATE Statement.

Merge into external HPE Ezmeral Data Fabric Database JSON tables

The MERGE operator is also available for external HPE Ezmeral Data Fabric Database JSON tables. You can use the MERGE statement to insert and update values in external Data Fabric database JSON table targets.

Merge into partitioned HPE Ezmeral Data Fabric Database JSON tables

Partitioned HPE Ezmeral Data Fabric Database JSON tables are not supported.

Merge into temporary HPE Ezmeral Data Fabric Database JSON tables

The MERGE operator is also available for temporary HPE Ezmeral Data Fabric Database JSON tables. Use temporary tables as target tables for merge. No additional syntax is needed.