Drill Table Function

Explains Apache Drill’s table functions (table-valued functions - TVFs).

Apache Drill supports table functions, also known as table-valued functions (TVFs). Functions return a table. You can use functions anywhere that you can use a table. Table functions behave similarly to views, but can take parameters.

Syntax

To invoke a table function, you must wrap it with the special TABLE() construct. This wrapper signals that the function returns a table rather than a scalar value:
table(table_function_name())

This construct is case-insensitive.

Drill table functions

Currently, Drill supports two types of table functions.
  • Schema table function
  • Format plugin table function

These functions are called using the target table name, but they define different parameters.

The function arguments are passed using the parameter names and the => syntax:
table_name(parameter_name => some_value)
CAUTION
Function parameter names are case-sensitive.

Schema table function

A schema table function allows you to define a schema for an individual query. You can specify a path to the schema or provide the schema directly in the query. This is useful when the table’s schema, which normally resides in the table’s root folder, is located elsewhere, or when you want to reference an alternate schema or define the schema inline.

The schema table function works with tables from any data storage, but schema provisioning is supported only for text files.

Starting with DEP version 10.0.0, schema provisioning is also supported for JSON MapR-DB tables. However, the schema is used only as a hint to resolve ambiguous column types, such as columns that contain only null values. Schema provisioning for JSON MapRDB tables is not a replacement for the CAST function. If you have a column stored as STRING and you want it to be INT – use the CAST function: Data Type Conversion - Apache Drill.

The schema table function has one parameter: schema.

For more details about the schema provisioning feature and the schema table function syntax, see Schema provisioning using a table function.

Parameters
  • schema - accepts a string value of two kinds: path and inline.
The path value is used to specify a path to the table schema:
table_name(schema => 'path=`/path/to/the/schema`')
For example:
SELECT * 
          FROM TABLE(dfs.tmp.`myTable.cvs`(schema => 'path=`/tmp/schemas/myTableSchema`'))
The inline value is used to provide a schema directly in the query:
table_name(schema => 'inline=(column_name data_type [nullability] [format] [default] [properties {prop='val', ...})]')
For example:
SELECT * 
          FROM TABLE(dfs.tmp.`text_table`(
          schema => 'inline=(col1 date properties {`drill.format` = `yyyy-MM-dd`})
          properties {`drill.strict` = `false`}'))

Format plugin table function

The format plugin table function allows you to override the format options from the file system storage plugin or replace the table’s detected format with your own.

The format specified here takes precedence over the format automatically detected by Drill. If the defined format matches the format detected by Drill, the storage format options from the storage plugin are used as defaults and overridden by any parameters specified in the table function.

Parameters
  • type - required.The file system table format. Valid values are the names of Drill format plugins. The values are case-sensitive and include: delta, spss, logregex, syslog, parquet, sequencefile, pcap, xml, iceberg, json, ltsv, text, maprdb, avro, image, sas, pcapng, streams, httpd, excel, pdf, shp, hdf5, msaccess.

  • dynamic parameters - optional. Parameters that correspond to the Drill format plugin configuration variables, excluding extensions (which do not apply in this context).
  • schema - optional. The value corresponds to the value for the same option in the schema table function.
Examples
Set extractHeader option for a csv file:
SELECT *
FROM TABLE(dfs.tmp.`users.csv`(type => 'text', extractHeader => true))
Set fieldDelimiter, extractHeader, and schema for a csv file:
SELECT *
FROM TABLE(dfs.tmp.`cars.csvh-test`(
  type => 'text', 
  fieldDelimiter => ',', 
  extractHeader => true, 
  schema => 'inline=(col1 date)'))
Use allTextMode for a json file:
SELECT *
FROM TABLE(dfs.tmp.`students`(type => 'json', allTextMode => true))
Dynamic parameters
This is a list of parameters for common formats.
CAUTION
The default value will be used if the parameter wasn't set either in the storage plugin or the format plugin table function.
Table 1. text
Parameter name Default value Description
lineDelimiter \n

A string used as a line delimiter in the text file.

fieldDelimiter ,

A character used as a field delimiter in the text file.

quote "

Defines the character used to enclose text fields that may contain delimiters, line breaks, or other special characters. For example, when quote is ", a field value such as "New York, NY" is treated as a single column value even though it contains a comma.

escape "

Specifies the character used to escape a quotation mark inside a quoted field. For example, with escape set to ", the value "He said ""Hello""" will be parsed as He said "Hello".

comment #

Identifies the character that marks the beginning of a comment line. Any line starting with this character is ignored during parsing. For example, if comment is #, then a line beginning with # is treated as a comment.

skipFirstLine false

A Boolean value that indicates whether the first line in the text file should be omitted.

extractHeader false

A Boolean value that indicates whether the first valid record parsed from the text file should be considered as the row containing the names of each column.

Table 2. json
Parameter name Default value Description
allTextMode The value defined in store.json.all_text_mode. Default value is false.

A Boolean option that forces all fields in JSON files to be read as VARCHAR. This helps prevent schema change errors when field types vary between records.

readNumbersAsDouble

The value defined in

store.json.read_numbers_as_double. Default value is false.

A Boolean value. If true, all numeric values are read as DOUBLE, regardless of whether they include a decimal point.

skipMalformedJSONRecords The value defined in

store.json.reader.skip_invalid_records. Default value is false.

A Boolean value that enables the JSON reader to skip malformed or invalid records in JSON files. When set to true, Drill ignores these records and continues processing the remaining data, rather than terminating the query.

escapeAnyChar The value defined in

store.json.reader.allow_escape_any_char. Default value is false.

A Boolean value that enables the JSON record reader to escape any character when processing JSON data.

nanInf The value defined in

store.json.reader.allow_nan_inf. Default value is true.

A Boolean value that enables the JSON record reader to interpret NaN and Infinity tokens in JSON data as numeric values.

Table 3. maprdb
Parameter name Default value Description
allTextMode false

A Boolean option that forces all fields in the MapR-DB table to be read as VARCHAR. This helps prevent schema change errors when field types vary between records.

enablePushdown true

Enables query pushdown to the MapR-DB engine. When set to true, Drill delegates supported filters and projections directly to MapR-DB for improved performance.

ignoreSchemaChange false

Allows queries to continue even when schema changes are detected in MapR-DB tables. When enabled, Drill ignores schema inconsistencies instead of failing the query.

readAllNumbersAsDouble

false

A Boolean value. If set to true, all numeric values are read as DOUBLE, regardless of whether they include a decimal point.

disableCountOptimization

false

Disables the optimization that uses table metadata to compute COUNT(*) queries more efficiently. When set to true, Drill performs a full scan instead of using the optimization.

readTimestampWithZoneOffset

false

When set to true, Drill converts timestamp values from UTC to the local time zone.

nonExistentFieldSupport

true

Enables support for queries that reference fields not present in all documents. When set to true, Drill treats missing fields in MapR-DB records as NULL.

index -

A string value that specifies a hint for the secondary index table to use when executing queries on MapR-DB. If not set, Drill automatically evaluates available indexes, computes their costs, and selects the most efficient one.