Schema provisioning using a table function
Schema provisioning defines explicit schemas so that Apache Drill reliably interprets text/JSON/MapR‑DB JSON data.
- Placed in the root directory of the text table. For more information, see https://drill.apache.org/docs/create-or-replace-schema/.
- Supplied to individual queries using a table function.
This topic describes the second approach, in which the schema is provided through a table function.
Use cases
In Apache Drill, users can query different storage and data formats. Some formats have a schema (parquet), some formats have an inferred schema (json), and for some formats, data is read as varchar only (csv). For the latter case, the user has to use casts to indicate which format is actually stored in the file, although there are some cases where casts cannot help and the query will fail. Even if casts do help, at some point the user might want to avoid repeating the same casts in every query (casting an HBase column to a Double, casting a CSV column to a double, etc). Providing the schema up front helps to resolve these issues.
Following are the major error cases of incorrect schema inference and schema change exception.
{a: 10} {a: 10.1}Apache Drill assumes that the column type is BigInt and fails with the following exception:
Error: INTERNAL_ERROR ERROR: You tried to write a Float8 type when you are using a ValueWriter of type NullableBigIntWriterImpl.
SELECT cast(a AS DOUBLE) FROM file.jsonThe same schema change issues will occur if the column has a mix of data types that do not belong to the same group, for example, integer, decimal, varchar. There is currently an option to help the user query such data by treating all data as varchar, but it is only applicable for json format.
myDir /
|- File 1: {a: 10}
|- File 2: {a: 10.1}SchemaChangeException may occur on blocking
operators.SELECT a, count() FROM myDir GROUP BY aUNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type.
Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`$f1` (INT:REQUIRED)]], selectionVector=NONE]
Incoming schema BatchSchema [fields=[[`a` (FLOAT8:OPTIONAL)], [`$f1` (INT:REQUIRED)]], selectionVector=NONE]
SELECT cast(a AS DOUBLE), count() FROM myDir GROUP BY 1Syntax
The schema can be provided using the Drill table function with the schema parameter. You can define the schema in a file and specify the path to that file in the table function.
table(<table_name>(schema => 'path=`<path to the schema>`'))For example:
select * from table(dfs.tmp.`text_table`(schema => 'path=`/tmp/my_schema`'))
<table_name>(schema => 'inline=(<column-name> <type> [NOT NULL] [default <value>] [properties {<col-prop>=<value>, ...}], ...) properties {<schema-prop>=<value>, ...}')inline=(
`user name` VARCHAR NOT NULL default `unknown`,
`credit card number` VARCHAR properties {`drill.blank-as`=`null`, `drill.special`=`true`}
) properties {`drill.special`=`true`}inline=(<Column Definition>, ...) <Schema properties>The inline schema is recognized based on the following ANLR grammar: https://github.com/apache/drill/blob/master/exec/vector/src/main/antlr4/org/apache/drill/exec/record/metadata/schema/parser/SchemaParser.g4
<column-name> <type> [NOT NULL] [default <value>] [properties {<key>=<value>, ...}]Column Name
required
`user name`
emailColumn Type
required
Specifies the data type of the column (case-insensitive). Data types can be primitive and complex. Apache Drill supports most of the primitive data types. Parentheses should be used to provide type length, precision, and scale. Indicating the length for a varchar is optional.
| SQL Data Type | Description | Example |
|---|---|---|
| UINT1 |
Unsigned 1-byte integer |
255 |
| UINT2 |
Unsigned 2-byte integer |
65535 |
| UINT4 | Unsigned 4-byte integer |
4294967295 |
| UINT8 | Unsigned 8-byte integer |
18446744073709551615 |
| TINYINT | 1-byte signed integer | 127 |
| SMALLINT | 2-byte signed integer | 32767 |
|
INT INTEGER |
4-byte signed integer in the range -2,147,483,648 to 2,147,483,647 |
2147483646 |
| BIGINT |
8-byte signed integer in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 |
9223372036854775807 |
|
BINARY(p) VARBINARY(p) |
Variable-length byte string p – precision |
B@e6d9eb7 |
| BOOLEAN | True or false | true |
| DATE |
Years, months, and days in YYYY-MM-DD format since 4713 BC |
2015-12-30 |
|
DEC(p,s) DECIMAL(p,s) NUMERIC(p,s) |
38-digit precision number. p – precision s – scale |
DECIMAL(6,2) is 1234.56, 4 digits before and 2 digits after the decimal point |
| FLOAT | 4-byte floating point number | 0.456 |
| DOUBLE |
8-byte floating point number, precision-scalable |
0.456 |
| INTERVAL |
A day-time or year-month interval |
‘1 10:20:30.123’ (day-time) or ‘1-2’ year to month (year-month) |
| SMALLINT |
2-byte signed integer in the range -32,768 to 32,767 |
32000 |
| TIME |
24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss |
22:55:55.23 |
| TIMESTAMP |
JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS |
2015-12-30 22:55:55.23 |
|
CHAR(length) CHARACTER(length) CHARACTER VARYING(length) VARCHAR(length) |
UTF8-encoded variable-length string. The default limit is 1 character. The maximum character limit is 2,147,483,647. |
CHAR(30) casts data to a 30-character string maximum. |
|
LATE DYNAMIC |
An internal data type that may appear in schemas but is not intended for direct use by end users. |
- |
When specifying complex types in a schema, the diamond (<>) syntax must
be used.
| SQL Data Type | Description | Example |
|---|---|---|
| ARRAY |
An array is a repeated list of values. Each element can be a scalar type, such as
a |
ARRAY<INT> ARRAY<VARCHAR> ARRAY<ARRAY<INT>> ARRAY<STRUCT<s1 INT not null, s2 INT>> |
| STRUCT |
A set of named values, each with its own data type. It is similar to an object,
where each field can contain a scalar or complex value. Each field within a
|
STRUCT<s1 INT, s2 INT> STRUCT<s1 INT not null, s2 VARCHAR(10)> stSTRUCTruct<s1 INT, ARRAY<INT>> STRUCT<s1 INT not null, s2 STRUCT<ss1 INT, ss2 INT>> |
| UNION |
An internal data type that may appear in schemas but is not intended for direct use by end users. |
- |
| MAP |
An internal data type that may appear in schemas but is not intended for direct use by end users. |
- |
Nullability
optional
The keyword NOT NULL indicates that the column cannot contain null values.
If this keyword is not specified, the column is considered nullable by default.
Column properties
optional
FORMAT and
DEFAULT, which are defined using the corresponding case-insensitive
keywords. The value for each property must be enclosed in single quotes. Both keywords are
optional; however, when used together, their declaration order is strict. The
FORMAT keyword must appear before DEFAULT, and both must
be specified after the column definition.
Examples:col DATE NOT NULL FORMAT `yyyy-MM-dd`
col DATE NOT NULL DEFAULT `2019-01-01`
col DATE NOT NULL FORMAT `yyyy-MM-dd` DEFAULT `2019-01-01`Other properties can be indicated after PROPERTIES clause:
col DATE NOT NULL FORMAT `yyyy-MM-dd` DEFAULT `2019-01-01` PROPERTIES {`drill.blank-as` = `null`}
FORMAT,
DEFAULT, and PROPERTIES can be defined. If the reader
implementation is able to use any of them, they will be utilized; otherwise, they will be
ignored.| Property name | Description | Example |
|---|---|---|
|
FORMAT drill.format |
Specifies the date/time format pattern (Joda Time syntax) used to convert string values to
date/time types. Recommended for accurate parsing of date/time columns. If not
set, Drill uses ISO date/time formats by default (e.g.,
Format can be indicated using two ways: via |
start_date format `yyyy-MM-dd` start_date date properties {`drill.format` = `yyyy-MM-dd`} |
|
DEFAULT drill.default |
Specifies a custom default value for non-nullable (required) columns that are missing in the data. Drill uses this value instead of the natural default (e.g., 0 for numbers). Has no effect for nullable columns. Can be set using the |
id int not null default '-1' id int not null properties {'drill.default'= '-1'} |
| drill.special |
Indicates whether to project the column in a wildcard (*) query. Special columns may be excluded from projection. Certain "special" columns may be available only when explicitly requested. Default |
col VARCHAR PROPERTIES {`drill.special` = `true`} |
| drill.blank-as |
Indicates how to handle blanks. The value should be either
This is not intended for direct use by end users. |
- |
Schema properties
optional
PROPERTIES
clause, which appears after the column definition list enclosed in parentheses. The syntax
is identical to the Column properties syntax. For
example:(col VARCHAR PROPERTIES {`drill.blank-as` = `null`}) PROPERTIES {`drill.strict` = `true`}| Property name | Description | Example |
|---|---|---|
| drill.strict |
If true, Drill enforces a strict schema: any columns in the data but not defined in the schema are ignored, and queries for such columns return NULL. If Default |
(col VARCHAR PROPERTIES {drill.blank-as = null}) PROPERTIES {drill.strict = true} |
Schema provisioning for MapR-DB JSON tables
Starting with DEP10.0.0, Drill supports schema
provisioning for MapR-DB JSON tables. Unlike text and standard JSON formats, the Drill
MapR-DB JSON reader uses the provided schema only when a column’s type is ambiguous –
for example, if all values in the first batch are missing or null. All
other columns and schema properties are ignored.