Schema provisioning using a table function

Schema provisioning defines explicit schemas so that Apache Drill reliably interprets text/JSON/MapR‑DB JSON data.

You can define a schema for text files and JSON MapR-DB tables. The schema can be provided in two ways:
  1. Placed in the root directory of the text table. For more information, see https://drill.apache.org/docs/create-or-replace-schema/.
  2. Supplied to individual queries using a table function.

This topic describes the second approach, in which the schema is provided through a table function.

Use cases

In Apache Drill, users can query different storage and data formats. Some formats have a schema (parquet), some formats have an inferred schema (json), and for some formats, data is read as varchar only (csv). For the latter case, the user has to use casts to indicate which format is actually stored in the file, although there are some cases where casts cannot help and the query will fail. Even if casts do help, at some point the user might want to avoid repeating the same casts in every query (casting an HBase column to a Double, casting a CSV column to a double, etc). Providing the schema up front helps to resolve these issues.

Following are the major error cases of incorrect schema inference and schema change exception.

Incorrect schema inference
This example assumes that there is a json with two records:
{a: 10} {a: 10.1}

Apache Drill assumes that the column type is BigInt and fails with the following exception:

Error: INTERNAL_ERROR ERROR: You tried to write a Float8 type when you are using a ValueWriter of type NullableBigIntWriterImpl.

Because casts are not pushed down into the scan, adding casts also does not help and the same error is returned:
SELECT cast(a AS DOUBLE) FROM file.json

The same schema change issues will occur if the column has a mix of data types that do not belong to the same group, for example, integer, decimal, varchar. There is currently an option to help the user query such data by treating all data as varchar, but it is only applicable for json format.

Schema evolution
This example assumes that there are two json files, both of which have the same column but different numeric types.
myDir /
              |- File 1: {a: 10}
              |- File 2: {a: 10.1}
If a user queries either file individually, the query succeeds; however, when querying the files together, a SchemaChangeException may occur on blocking operators.
SELECT a, count() FROM myDir  GROUP BY a

UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type.

Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`$f1` (INT:REQUIRED)]], selectionVector=NONE]

Incoming schema BatchSchema [fields=[[`a` (FLOAT8:OPTIONAL)], [`$f1` (INT:REQUIRED)]], selectionVector=NONE]

This can be resolved by adding casts in the query:
SELECT cast(a AS DOUBLE), count() FROM myDir  GROUP BY 1

Syntax

The schema can be provided using the Drill table function with the schema parameter. You can define the schema in a file and specify the path to that file in the table function.

For details on how to create a schema file, see https://drill.apache.org/docs/create-or-replace-schema/. Syntax:
table(<table_name>(schema => 'path=`<path to the schema>`'))

For example:

select * from table(dfs.tmp.`text_table`(schema => 'path=`/tmp/my_schema`'))
Or the schema can be defined inline, directly within the query:
<table_name>(schema => 'inline=(<column-name> <type> [NOT NULL] [default <value>] [properties {<col-prop>=<value>, ...}], ...) properties {<schema-prop>=<value>, ...}')
For example:
inline=(
    `user name` VARCHAR NOT NULL default `unknown`,
    `credit card number` VARCHAR properties {`drill.blank-as`=`null`, `drill.special`=`true`}
) properties {`drill.special`=`true`}
The general syntax for an inline schema is as follows:
inline=(<Column Definition>, ...) <Schema properties>

The inline schema is recognized based on the following ANLR grammar: https://github.com/apache/drill/blob/master/exec/vector/src/main/antlr4/org/apache/drill/exec/record/metadata/schema/parser/SchemaParser.g4

Column Definition
The column definiton follows this structure:
<column-name> <type> [NOT NULL] [default <value>] [properties {<key>=<value>, ...}]

Column Name required

The name of the column. It must be enclosed in backticks if it contains spaces or special characters. For example:
`user name`
email

Column Type required

Specifies the data type of the column (case-insensitive). Data types can be primitive and complex. Apache Drill supports most of the primitive data types. Parentheses should be used to provide type length, precision, and scale. Indicating the length for a varchar is optional.

Table 1. Simple types
SQL Data Type Description Example
UINT1

Unsigned 1-byte integer

255
UINT2

Unsigned 2-byte integer

65535
UINT4 Unsigned 4-byte integer

4294967295

UINT8 Unsigned 8-byte integer

18446744073709551615

TINYINT 1-byte signed integer 127
SMALLINT 2-byte signed integer 32767

INT

INTEGER

4-byte signed integer in the range -2,147,483,648 to 2,147,483,647

2147483646

BIGINT

8-byte signed integer in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807

9223372036854775807

BINARY(p)

VARBINARY(p)

Variable-length byte string

p – precision

B@e6d9eb7

BOOLEAN True or false true
DATE

Years, months, and days in YYYY-MM-DD format since 4713 BC

2015-12-30

DEC(p,s)

DECIMAL(p,s)

NUMERIC(p,s)

38-digit precision number.

p – precision

s – scale

DECIMAL(6,2) is 1234.56, 4 digits before and 2 digits after the decimal point

FLOAT 4-byte floating point number 0.456
DOUBLE

8-byte floating point number, precision-scalable

0.456
INTERVAL

A day-time or year-month interval

‘1 10:20:30.123’ (day-time) or ‘1-2’ year to month (year-month)

SMALLINT

2-byte signed integer in the range -32,768 to 32,767

32000
TIME

24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss

22:55:55.23

TIMESTAMP

JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS

2015-12-30 22:55:55.23

CHAR(length)

CHARACTER(length)

CHARACTER VARYING(length)

VARCHAR(length)

UTF8-encoded variable-length string. The default limit is 1 character. The maximum character limit is 2,147,483,647.

CHAR(30) casts data to a 30-character string maximum.

LATE

DYNAMIC

An internal data type that may appear in schemas but is not intended for direct use by end users.

-

When specifying complex types in a schema, the diamond (<>) syntax must be used.

Table 2. Complex types
SQL Data Type Description Example
ARRAY

An array is a repeated list of values. Each element can be a scalar type, such as a VARCHAR or INTEGER, or a complex type, such as a MAP or another ARRAY.

ARRAY<INT>

ARRAY<VARCHAR>

ARRAY<ARRAY<INT>> ARRAY<STRUCT<s1 INT not null, s2 INT>>

STRUCT

A set of named values, each with its own data type. It is similar to an object, where each field can contain a scalar or complex value. Each field within a STRUCT can hold a simple or a complex type.

STRUCT<s1 INT, s2 INT>

STRUCT<s1 INT not null, s2 VARCHAR(10)>

stSTRUCTruct<s1 INT, ARRAY<INT>>

STRUCT<s1 INT not null, s2 STRUCT<ss1 INT, ss2 INT>>

UNION

An internal data type that may appear in schemas but is not intended for direct use by end users.

-

MAP

An internal data type that may appear in schemas but is not intended for direct use by end users.

-

Nullability optional

The keyword NOT NULL indicates that the column cannot contain null values. If this keyword is not specified, the column is considered nullable by default.

Column properties optional

There are two built-in column properties: FORMAT and DEFAULT, which are defined using the corresponding case-insensitive keywords. The value for each property must be enclosed in single quotes. Both keywords are optional; however, when used together, their declaration order is strict. The FORMAT keyword must appear before DEFAULT, and both must be specified after the column definition. Examples:
col DATE NOT NULL FORMAT `yyyy-MM-dd`
col DATE NOT NULL DEFAULT `2019-01-01`
col DATE NOT NULL FORMAT `yyyy-MM-dd` DEFAULT `2019-01-01`

Other properties can be indicated after PROPERTIES clause:

col DATE NOT NULL FORMAT `yyyy-MM-dd` DEFAULT `2019-01-01` PROPERTIES {`drill.blank-as` = `null`}
There is no limitation in the parser regarding which column types FORMAT, DEFAULT, and PROPERTIES can be defined. If the reader implementation is able to use any of them, they will be utilized; otherwise, they will be ignored.
Table 3. Column properties
Property name Description Example

FORMAT

drill.format

Specifies the date/time format pattern (Joda Time syntax) used to convert string values to date/time types. Recommended for accurate parsing of date/time columns. If not set, Drill uses ISO date/time formats by default (e.g., yyyy-MM-dd'T'HH:mm:ssZZ for timestamps).

Format can be indicated using two ways: via FORMAT keyword (before DEFAULT keyword) or in the column properties section.

start_date format `yyyy-MM-dd`

start_date date properties {`drill.format` = `yyyy-MM-dd`}

DEFAULT

drill.default

Specifies a custom default value for non-nullable (required) columns that are missing in the data. Drill uses this value instead of the natural default (e.g., 0 for numbers). Has no effect for nullable columns.

Can be set using the DEFAULT keyword or in the column’s properties.

id int not null default '-1'

id int not null properties {'drill.default'= '-1'}

drill.special

Indicates whether to project the column in a wildcard (*) query. Special columns may be excluded from projection. Certain "special" columns may be available only when explicitly requested.

Default false

col VARCHAR PROPERTIES {`drill.special` = `true`}

drill.blank-as

Indicates how to handle blanks. The value should be either null or 0.

0 – convert blanks for numeric fields to 0. For non-numeric fields, convert to null (for nullable) or the default value (for non-nullable).

null – convert blanks to null values (if the column is nullable), or fill with the default value (non-nullable).

This is not intended for direct use by end users.

-

Schema properties optional

Schema properties are defined in an inline schema using the PROPERTIES clause, which appears after the column definition list enclosed in parentheses. The syntax is identical to the Column properties syntax. For example:
(col VARCHAR PROPERTIES {`drill.blank-as` = `null`}) PROPERTIES {`drill.strict` = `true`}
Table 4. Schema properties
Property name Description Example
drill.strict

If true, Drill enforces a strict schema: any columns in the data but not defined in the schema are ignored, and queries for such columns return NULL.

If false, Drill includes extra columns found in the data after the schema-defined columns.

Default false

(col VARCHAR PROPERTIES {drill.blank-as = null}) PROPERTIES {drill.strict = true}

Schema provisioning for MapR-DB JSON tables

Starting with DEP10.0.0, Drill supports schema provisioning for MapR-DB JSON tables. Unlike text and standard JSON formats, the Drill MapR-DB JSON reader uses the provided schema only when a column’s type is ambiguous – for example, if all values in the first batch are missing or null. All other columns and schema properties are ignored.