Writing Drill Queries that Leverage Indexes on Array Fields
Starting in EEP 6.0, the query planner in
Drill can leverage indexes created on Data Fabric
Database JSON document fields with array data types, such as "NUMBERS": [1,
2, 3, 4, 5]
and "ADDRESSES": [{"CITY" : "SAN JOSE"}, {"CITY" : "PALO
ALTO"}]
.
See JSON Document Data Types and Data Types and Secondary Index Fields for definitions and detailed examples.
SELECT NAME, PHONE
FROM CUSTOMERS
WHERE _id IN ( SELECT _id
FROM ( SELECT _id, FLATTEN(ADDRESSES) as f
FROM CUSTOMERS) as t
WHERE t.f.CITY = 'SAN JOSE' and t.f.STATE = 'CA')
;
The specific SQL syntax indicates (to the query planner in Drill) that the query is eligible for an index-based query plan.
The FLATTEN function separates elements in an array into individual records in a table. For example, if an array consists of five elements, FLATTEN separates each element into a single row, creating a table with five rows.
The IN operator prevents Drill from returning duplicate rows. For example, when an array is flattened into a table, duplicate values may exist for a particular _id (rowkey). Using IN prevents Drill from returning rows with duplicate values.
Example
{ "_id": "001",
"NAME": "ALICE",
"PHONE": "408-555-1212",
"ADDRESSES": [{"CITY" : "SAN JOSE" , "ZIPCODE" : 95124, "STATE" : "CA", "UNITS" : [{"UNIT_NO":555, "FLOOR": 5}, {"UNIT_NO":777, "FLOOR": 7}]}, {"CITY" : "PALO ALTO", "ZIPCODE" : 94020, "STATE" : "CA", "UNITS" : [{"UNIT_NO":555, "FLOOR": 5}, {"UNIT_NO":777, "FLOOR": 7}]}, {"CITY" : "SANTA CLARA", "ZIPCODE" : 95050, "STATE" : "CA", "UNITS" : [{"UNIT_NO":555, "FLOOR": 5}, {"UNIT_NO":777, "FLOOR": 7}]}],
"QTY": [11, 25, 16, 2, 10, 39, 5, 8, 7, 11]
}
{ "_id": "002",
"NAME": "BOB",
"PHONE": "408-555-1313",
"ADDRESSES": [{"CITY" : "SAN JOSE" , "ZIPCODE" : 95132, "STATE" : "CA", "UNITS" : [{"UNIT_NO":838, "FLOOR": 8}, {"UNIT_NO":888, "FLOOR": 8}]}, {"CITY" : "SAN JOSE", "ZIPCODE" : 95127, "STATE" : "CA", "UNITS" : [{"UNIT_NO":555, "FLOOR": 5}, {"UNIT_NO":777, "FLOOR": 7}]}, {"CITY" : "SAN RAMON", "ZIPCODE" : 94582, "STATE" : "CA", "UNITS" : [{"UNIT_NO":123, "FLOOR": 1}, {"UNIT_NO":124, "FLOOR": 1}]}],
"QTY": [2, 8, 1, 4, 3, 10, 2, 23]
}
{ "_id": "003",
"NAME": "CHRIS",
"PHONE": "408-555-1414",
"ADDRESSES": [{"CITY" : "MOUNTAIN VIEW", "ZIPCODE" : 94043, "STATE" : "CA", "UNITS" : [{"UNIT_NO":922, "FLOOR": 9}, {"UNIT_NO":958, "FLOOR": 9}]}, {"CITY" : "PALO ALTO", "ZIPCODE" : 94020, "STATE" : "CA", "UNITS" : [{"UNIT_NO":666, "FLOOR": 6}, {"UNIT_NO":728, "FLOOR": 7}]}, {"CITY" : "SUNNYVALE", "ZIPCODE" : 94086, "STATE" : "CA", "UNITS" : [{"UNIT_NO":226, "FLOOR": 2}, {"UNIT_NO":333, "FLOOR": 3}]}],
"QTY": [56, 19, 45, 25, 4, 77, 110, 3, 2, 1]
}
SELECT NAME, PHONE, f FROM (SELECT NAME, PHONE, FLATTEN(ADDRESSES) AS f FROM CUSTOMERS);
+---------+--------------+--------------------------------------------------------------------------------------------------------------------+
| NAME | PHONE | f |
+---------+--------------+--------------------------------------------------------------------------------------------------------------------+
| ALICE | 408-555-1212 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":95124} |
| ALICE | 408-555-1212 | {"CITY":"PALO ALTO","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":94020} |
| ALICE | 408-555-1212 | {"CITY":"SANTA CLARA","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":95050} |
| BOB | 408-555-1313 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":8,"UNIT_NO":838},{"FLOOR":8,"UNIT_NO":888}],"ZIPCODE":95132} |
| BOB | 408-555-1313 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":95127} |
| BOB | 408-555-1313 | {"CITY":"SAN RAMON","STATE":"CA","UNITS":[{"FLOOR":1,"UNIT_NO":123},{"FLOOR":1,"UNIT_NO":124}],"ZIPCODE":94582} |
| CHRIS | 408-555-1414 | {"CITY":"MOUNTAIN VIEW","STATE":"CA","UNITS":[{"FLOOR":9,"UNIT_NO":922},{"FLOOR":9,"UNIT_NO":958}],"ZIPCODE":94043} |
| CHRIS | 408-555-1414 | {"CITY":"PALO ALTO","STATE":"CA","UNITS":[{"FLOOR":6,"UNIT_NO":666},{"FLOOR":7,"UNIT_NO":728}],"ZIPCODE":94020} |
| CHRIS | 408-555-1414 | {"CITY":"SUNNYVALE","STATE":"CA","UNITS":[{"FLOOR":2,"UNIT_NO":226},{"FLOOR":3,"UNIT_NO":333}],"ZIPCODE":94086} |
+---------+--------------+-------------------------------------------------------------------------------------------------------------------+
SELECT NAME, PHONE, f FROM (SELECT NAME, PHONE, FLATTEN(ADDRESSES) AS f FROM CUSTOMERS) AS t WHERE t.f.CITY = 'SAN JOSE' and t.f.STATE = 'CA';
+-------+--------------+---------------------------------------------------------------------------------------------------------------+
| NAME | PHONE | f |
+-------+--------------+---------------------------------------------------------------------------------------------------------------+
| ALICE | 408-555-1212 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":95124} |
| BOB | 408-555-1313 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":8,"UNIT_NO":838},{"FLOOR":8,"UNIT_NO":888}],"ZIPCODE":95132} |
| BOB | 408-555-1313 | {"CITY":"SAN JOSE","STATE":"CA","UNITS":[{"FLOOR":5,"UNIT_NO":555},{"FLOOR":7,"UNIT_NO":777}],"ZIPCODE":95127} |
+-------+--------------+---------------------------------------------------------------------------------------------------------------+
ADDRESSES[].CITY
and
ADDRESSES[].STATE
with "NAME" as an included field. For the query planner
to use the index, you must write the query using the specific SQL syntax that indicates that
the query is eligible for an index-based query plan, as shown:
SELECT NAME, PHONE
FROM CUSTOMERS
WHERE _id IN ( SELECT _id
FROM ( SELECT _id, FLATTEN(ADDRESSES) as f
FROM CUSTOMERS) as t
WHERE t.f.CITY = 'SAN JOSE' and t.f.STATE = 'CA');
//Issuing this query against the data in the CUSTOMERS table returns the following results:
+--------+---------------+
| NAME | PHONE |
+--------+---------------+
| ALICE | 408-555-1212 |
| BOB | 408-555-1313 |
+--------+---------------+
The following list summarizes key points about this query:
- The innermost subquery projects on the _id field (rowkey) and includes the FLATTEN function to separate the array elements in the “ADDRESSES” field. The field “ADDRESSES” is flattened into a table aliased as “t”, in a column aliased as “f”.
- The query uses the IN operator to ensure that the results returned contain unique values only; no duplicates. DISTINCT on the subquery to the right of IN is implicit. The SQL query pattern indicates to the query planner that the query is eligible for an index-based query plan.
- The query projects on column “NAME” and “PHONE”. “PHONE” requires a join back to the primary table on the _id field (rowkey) because it is not included in the composite index.
- The query planner recognizes that t.f.CITY references t.ADDRESSES[].CITY and t.f.STATE references t.ADDRESSES[].STATE and creates an index-based query plan.
- The index table in HPE Ezmeral Data Fabric Database is already flattened for the array field, “ADDRESSES”. Flatten is not evaluated in Drill. Drill pushes the filter conditions on the array field into HPE Ezmeral Data Fabric Database.
Filter Conditions on Various Types of Array Fields
Filter condition on … | Example using HPE Ezmeral Data Fabric Database notation (not SQL notation) | SQL |
Array of scalar values | QTY[] < 10 |
This query returns the following
results:
|
Map field within an array of maps | ADDRESSES[].ZIPCODE > 94000 and ADDRESSES[].ZIPCODE < 95000 |
This query returns the following
results:
|
AND-ed condition on 2 fields of the same array element | elementAND(ADDRESSES[], CITY=SAN JOSE, STATE = CA) |
This query returns the following
results:
|
AND-ed condition on 2 fields of different array elements | ADDRESSES[].CITY = SAN JOSE AND ADDRESSES[].ZIPCODE = 94020 |
This query returns the following results:
|
AND-ed condition on scalar field and array field | PHONE = 408-555-1212 AND ADDRESSES[].ZIPCODE = 94020 |
This query returns the following
results:
|
Map field within nested array of maps | ADDRESSES[].UNITS[].FLOOR < 5 |
This query returns the following
results:
|
Exact match for lists or maps | col = ADDRESSES[].UNITS[].{"FLOOR":7,"UNIT_NO":777} |
This query returns the following
results:
|
Performance Considerations
When writing queries that leverage indexes on array fields, consider the following points about performance:
- Query patterns that match those described previously in this document are pushed down to HPE Ezmeral Data Fabric Database. Drill does not evaluate the filter conditions, which adds considerable performance benefits even when the query planner does not select an index-based query plan.
- Deduplication on the _id is an extra operation (compared to regular, non-complex, indexes) that requires the overhead of hash aggregation.
- Try to avoid array columns in included fields within an index table, as they add a significant amount of storage overhead. However, this may result in the query planner selecting non-covering plans.
- Indexes with deeply nested array elements, such as
a[].b[].c[]...x.y
, can add to the HPE Ezmeral Data Fabric Database storage overhead and can potentially make Drill queries longer and more complex.
Limitations
Drill queries that leverage indexes on array fields have the following limitations:
- Only queries with patterns similar to those described previously in this document are eligible for index planning, assuming that the index is defined on an array field.
- The following conditions do not produce a covering index plan:
- Pushdown conditions on indexed fields and included fields on same array element. For example, if an index has indexed fields a[].b and included fields a[].c, elementAND(a[], b > 10, c > 20) does not produce a covering index plan.
- Pushdown conditions on scalar indexed fields and included fields containing an array element. For example, an index with indexed field m and included fields a[].b, m = 10 AND a[].b > 20 does not produce a covering index plan.
- For included array fields, the element must be provided without the [] for the query
planner to pick covering plans. For example,
a
and nota[]
. Note that HPE Ezmeral Data Fabric Database considers both a and a[] syntaxes as equivalent for included fields. - Index planning is disabled for queries with multi-level flattens and intermediate filters that reference multi-level flattens. A filter can reference the root level flatten, but not the intermediate flattens.