Lesson 2: Run Queries with ANSI SQL

Goal

This lesson shows how to do some standard SQL analysis in Drill: for example, summarizing data by using simple aggregate functions and connecting data sources by using joins. Note that Drill provides ANSI SQL support, not a “SQL-like” interface.

Queries in This Lesson

Now that you know what the data sources look like in their raw form, using select * queries, try running some simple but more useful queries on each data source. These queries demonstrate how Drill supports ANSI SQL constructs and also how you can combine data from different data sources in a single SELECT statement.

  • Show an aggregate query on a single file or table. Use GROUP BY, WHERE, HAVING, and ORDER BY clauses.
  • Perform joins between Hive, HPE Ezmeral Data Fabric Database, and file system data sources.
  • Use table and column aliases.
  • Create a Drill view.

Aggregation

Set the schema to hive:

0: jdbc:drill:> use hive;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | Default schema changed to 'hive' |
+------------+------------+
1 row selected

Return sales totals by month:

0: jdbc:drill:> select `month`, sum(order_total)
from orders group by `month` order by 2 desc;
+------------+------------+
| month | EXPR$1 |
+------------+------------+
| June | 950481 |
| May | 947796 |
| March | 836809 |
| April | 807291 |
| July | 757395 |
| October | 676236 |
| August | 572269 |
| February | 532901 |
| September | 373100 |
| January | 346536 |
+------------+------------+

Drill supports SQL aggregate functions such as SUM, MAX, AVG, and MIN. Standard SQL clauses work in the same way in Drill queries as in relational database queries.

Note that back ticks are required for the “month” column only because “month” is a reserved word in SQL.

Return the top 20 sales totals by month and state:

0: jdbc:drill:> select `month`, state, sum(order_total) as sales from orders group by `month`, state
order by 3 desc limit 20;
+------------+------------+------------+
|   month    |   state    |   sales    |
+------------+------------+------------+
| May        | ca         | 119586     |
| June       | ca         | 116322     |
| April      | ca         | 101363     |
| March      | ca         | 99540      |
| July       | ca         | 90285      |
| October    | ca         | 80090      |
| June       | tx         | 78363      |
| May        | tx         | 77247      |
| March      | tx         | 73815      |
| August     | ca         | 71255      |
| April      | tx         | 68385      |
| July       | tx         | 63858      |
| February   | ca         | 63527      |
| June       | fl         | 62199      |
| June       | ny         | 62052      |
| May        | fl         | 61651      |
| May        | ny         | 59369      |
| October    | tx         | 55076      |
| March      | fl         | 54867      |
| March      | ny         | 52101      |
+------------+------------+------------+
20 rows selected

Note the alias for the result of the SUM function. Drill supports column aliases and table aliases.

HAVING Clause

This query uses the HAVING clause to constrain an aggregate result.

Set the workspace to dfs.clicks

0: jdbc:drill:> use dfs.clicks;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | Default schema changed to 'dfs.clicks' |
+------------+------------+
1 row selected

Return total number of clicks for devices that indicate high click-throughs:

0: jdbc:drill:> select t.user_info.device, count(*) from `clicks/clicks.json` t  
group by t.user_info.device
having count(*) > 1000;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| IOS5       | 11814      |
| AOS4.2     | 5986       |
| IOS6       | 4464       |
| IOS7       | 3135       |
| AOS4.4     | 1562       |
| AOS4.3     | 3039       |
+------------+------------+

The aggregate is a count of the records for each different mobile device in the clickstream data. Only the activity for the devices that registered more than 1000 transactions qualify for the result set.

UNION Operator

Use the same workspace as before (dfs.clicks).

Combine clicks activity from before and after the marketing campaign

0: jdbc:drill:> select t.trans_id transaction, t.user_info.cust_id customer from `clicks/clicks.campaign.json` t 
union all 
select u.trans_id, u.user_info.cust_id  from `clicks/clicks.json` u limit 5;
+-------------+------------+
| transaction |  customer  |
+-------------+------------+
| 35232       | 18520      |
| 31995       | 17182      |
| 35760       | 18228      |
| 37090       | 17015      |
| 37838       | 18737      |
+-------------+------------+

This UNION ALL query returns rows that exist in two files (and includes any duplicate rows from those files): clicks.campaign.json and clicks.json.

Subqueries

Set the workspace to hive:

0: jdbc:drill:> use hive;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | Default schema changed to 'hive' |
+------------+------------+

Compare order totals across states:

0: jdbc:drill:> select ny_sales.cust_id, ny_sales.total_orders, ca_sales.total_orders
    from
    (select o.cust_id, sum(o.order_total) as total_orders 
    from hive.orders o where state = 'ny' group by o.cust_id) ny_sales
    left outer join
    (select o.cust_id, sum(o.order_total) as total_orders 
    from 
    hive.orders o where state = 'ca' group by o.cust_id) ca_sales
    on ny_sales.cust_id = ca_sales.cust_id
    order by ny_sales.cust_id
    limit 20;
+------------+------------+------------+
|  cust_id   |  ny_sales  |  ca_sales  |
+------------+------------+------------+
| 1001       | 72         | 47         |
| 1002       | 108        | 198        |
| 1003       | 83         | null       |
| 1004       | 86         | 210        |
| 1005       | 168        | 153        |
| 1006       | 29         | 326        |
| 1008       | 105        | 168        |
| 1009       | 443        | 127        |
| 1010       | 75         | 18         |
| 1012       | 110        | null       |
| 1013       | 19         | null       |
| 1014       | 106        | 162        |
| 1015       | 220        | 153        |
| 1016       | 85         | 159        |
| 1017       | 82         | 56         |
| 1019       | 37         | 196        |
| 1020       | 193        | 165        |
| 1022       | 124        | null       |
| 1023       | 166        | 149        |
| 1024       | 233        | null       |
+------------+------------+------------+

This example demonstrates Drill support for correlated subqueries. This query uses a subquery in the select list and correlates the result of the subquery with the outer query, using the cust_id column reference. The subquery returns the sum of order totals for California, and the outer query returns the equivalent sum, for the same cust_id, for New York.

The result set is sorted by the cust_id and presents the sales totals side by side for easy comparison. Null values indicate customer IDs that did not register any sales in that state.

CAST Function

Use the maprdb workspace:

0: jdbc:drill:> use maprdb;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | Default schema changed to 'maprdb' |
+------------+------------+
1 row selected

Return customer data with appropriate data types

0: jdbc:drill:> select cast(row_key as int) as cust_id, cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, cast(t.personal.age as varchar(10)) as age,
cast(t.address.state as varchar(4)) as state, cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev, 
cast(t.loyalty.membership as varchar(20)) as membership
from customers t limit 5;
+------------+------------+------------+------------+------------+------------+------------+
|  cust_id   |    name    |   gender   |    age     |   state    |  agg_rev   | membership |
+------------+------------+------------+------------+------------+------------+------------+
| 10001      | "Corrine Mecham" | "FEMALE"   | "15-20"    | "va"       | 197.00     | "silver"   |
| 10005      | "Brittany Park" | "MALE"     | "26-35"    | "in"       | 230.00     | "silver"   |
| 10006      | "Rose Lokey" | "MALE"     | "26-35"    | "ca"       | 250.00     | "silver"   |
| 10007      | "James Fowler" | "FEMALE"   | "51-100"   | "me"       | 263.00     | "silver"   |
| 10010      | "Guillermo Koehler" | "OTHER"    | "51-100"   | "mn"       | 202.00     | "silver"   |
+------------+------------+------------+------------+------------+------------+------------+
5 rows selected

Note the following features of this query:

  • The CAST function is required for every column in the table. This function returns the HPE Ezmeral Data Fabric Database/HBase binary data as readable integers and strings. Depending on what encoding is used while populating the HPE Ezmeral Data Fabric Database binary tables/HBase tables, you might have to use CONVERT_TO/CONVERT_FROM functions to decode them.
  • The row_key column functions as the primary key of the table (a customer ID in this case).
  • The table alias t is required; otherwise the column family names would be parsed as table names and the query would return an error.

Remove the quotes from the strings:

You can use the regexp_replace function to remove the quotes around the strings in the query results. For example, to return a state name va instead of “va”:

0: jdbc:drill:> select cast(row_key as int), regexp_replace(cast(t.address.state as varchar(10)),'"','')
from customers t limit 1;
+------------+------------+
|   EXPR$0   |   EXPR$1   |
+------------+------------+
| 10001      | va         |
+------------+------------+
1 row selected

CREATE VIEW Command

Use a mutable workspace:

0: jdbc:drill:> use dfs.views;
+------------+------------+
| ok | summary |
+------------+------------+
| true | Default schema changed to 'dfs.views' |
+------------+------------+

A mutable (or writable) workspace is a workspace that is enabled for “write” operations. This attribute is part of the storage plugin configuration. You can create Drill views and tables in mutable workspaces.

Create a view on a HPE Ezmeral Data Fabric Database binary table

0: jdbc:drill:> create or replace view custview as select cast(row_key as int) as cust_id,
cast(t.personal.name as varchar(20)) as name, 
cast(t.personal.gender as varchar(10)) as gender, 
cast(t.personal.age as varchar(10)) as age, 
cast(t.address.state as varchar(4)) as state,
cast(t.loyalty.agg_rev as dec(7,2)) as agg_rev,
cast(t.loyalty.membership as varchar(20)) as membership
from maprdb.customers t;
+------------+------------+
|     ok     |  summary   |
+------------+------------+
| true       | View 'custview' replaced successfully in 'dfs.views' schema |
+------------+------------+
1 row selected

Drill provides CREATE OR REPLACE VIEW syntax similar to relational databases to create views. Use the OR REPLACE option to make it easier to update the view later without having to remove it first. Note that the FROM clause in this example must refer to maprdb.customers. The HPE Ezmeral Data Fabric Database binary tables are not directly visible to the dfs.views workspace.

Unlike a traditional database where views typically are DBA/developer-driven operations, file system-based views in Drill are very lightweight. A view is simply a special file with a specific extension (.drill). You can store views even in your local file system or point to a specific workspace. You can specify any query against any Drill data source in the body of the CREATE VIEW statement.

Dril provides a decentralized metadata model. Drill is able to query metadata defined in data sources such as Hive, HBase, and the file system. Drill also supports the creation of metadata in the file system.

Query data from the view:

0: jdbc:drill:> select * from custview limit 1;
+------------+------------+------------+------------+------------+------------+------------+
|  cust_id   |    name    |   gender   |    age     |   state    |  agg_rev   | membership |
+------------+------------+------------+------------+------------+------------+------------+
| 10001      | "Corrine Mecham" | "FEMALE"   | "15-20"    | "va"       | 197.00     | "silver"   |
+------------+------------+------------+------------+------------+------------+------------+

Once the users know what data is available by exploring it directly from the file system, views can be used as a way to read the data into downstream tools such as Tableau and MicroStrategy for analysis and visualization. For these tools, a view appears simply as a “table” with selectable “columns” in it.

Query Across Data Sources

Continue using dfs.views for this query.

Join the customers view and the orders table:

0: jdbc:drill:> select membership, sum(order_total) as sales from hive.orders, custview
where orders.cust_id=custview.cust_id
group by membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 380665     |
| "silver"   | 708438     |
| "gold"     | 2787682    |
+------------+------------+
3 rows selected

In this query, we are reading data from a HPE Ezmeral Data Fabric Database binary table (represented by custview) and combining it with the order information in Hive. When doing cross data source queries such as this, you need to use fully qualified table/view names. For example, the orders table is prefixed by “hive,” which is the storage plugin name registered with Drill. We are not using any prefix for “custview” because we explicitly switched the dfs.views workspace where custview is stored.

NOTE
Note: If the results of any of your queries appear to be truncated because the rows are wide, set the maximum width of the display to 10000:
0: jdbc:drill:> !set maxwidth 10000

Do not use a semicolon for this SET command.

Join the customers, orders, and clickstream data:

0: jdbc:drill:> select custview.membership, sum(orders.order_total) as sales from hive.orders, custview,
dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json` c 
where orders.cust_id=custview.cust_id and orders.cust_id=c.user_info.cust_id 
group by custview.membership order by 2;
+------------+------------+
| membership |   sales    |
+------------+------------+
| "basic"    | 372866     |
| "silver"   | 728424     |
| "gold"     | 7050198    |
+------------+------------+
3 rows selected

This three-way join selects from three different data sources in one query:

  • hive.orders table
  • custview (a view of the HBase customers table)
  • clicks.json file

The join column for both sets of join conditions is the cust_id column. The views workspace is used for this query so that custview can be accessed. The hive.orders table is also visible to the query.

However, note that the JSON file is not directly visible from the views workspace, so the query specifies the full path to the file:

dfs.`/mapr/demo.mapr.com/data/nested/clicks/clicks.json`

What's Next

Go to Lesson 3: Run Queries on Complex Data Types