Connecting to CSV and Parquet Data in an External S3 Data Source via Hive Connector

Describes how to use the Hive connector with Presto in HPE Ezmeral Unified Analytics Software to connect to CSV and Parquet data in S3-based external data sources.

You can connect HPE Ezmeral Unified Analytics Software to any external S3-based data source through the Hive connector and Presto to access CSV and Parquet data. For example, you can connect HPE Ezmeral Unified Analytics Software to an external Data Fabric, Iceberg, or Spark cluster to access CSV and Parquet data in S3 storage within these data sources.

Connection Requirements

Connecting to an S3-based external data source has the following requirements:
  • You must have read/write access on the S3 data source.
  • You must provide the required information to connect, including the:
    • Access key
    • Secret key
    • S3 directory where files are stored
    • S3 endpoint
Connection fields vary depending on the type of data source you connect to (Data Fabric, Iceberg, Spark, etc.); however, the S3-related fields (credentials, directory, and endpoint) are always required regardless of the data source you are accessing S3 data through.
IMPORTANT
A Hive Metastore is not required. HPE Ezmeral Unified Analytics Software scans the files in the S3 data source to get metadata and determine data types.

For more information about the PrestoDB Hive connector, see Hive Connector.

Connecting to an S3-Based Data Source

To connect to an S3-based data source:

  1. Sign in to HPE Ezmeral Unified Analytics Software.
  2. In the left navigation bar, select Data Engineering > Data Sources.
  3. On the Structured Data tab, click Add New.
  4. In the Hive tile, click Create Connection.
  5. In the drawer that opens, enter values in the required fields. An asterisk (*) denotes a required field.
    • For Hive Metastore, select Discovery.
    • In the Optional Fields search field, find and add the following fields:
      Field Name Description
      Hive S3 AWS Access Key Enter the AWS access key.
      Hive S3 AWS Secret Key Enter the AWS secret key.
      Hive S3 Endpoint Enter the S3 endpoint.
      Hive S3 Path Style Access Select the checkbox.
  6. Click Connect. Upon successful connection, the data source is available on the Data Sources page.
  7. In the left navigation bar, select Data Engineering > Data Sources.
  8. In the data source tile, click Query using Data Catalog to access the S3 data.
  9. On the Data Catalog page, identify the data source and select the schema to view the data.

Example: Connect to HPE Ezmeral Data Fabric Object Store to access Parquet files

This example demonstrates how to connect HPE Ezmeral Unified Analytics Software to Parquet data in HPE Ezmeral Data Fabric Object Store (S3-based data store).
TIP
To connect to HPE Ezmeral Data Fabric Object Store, you need working knowledge of HPE Ezmeral Data Fabric Object Store, read/write access to a bucket (granted via IAM policies), and the ability to generate the access key and secret key in HPE Ezmeral Data Fabric Object Store.

In this example, a bucket named prestodemo exists in HPE Ezmeral Data Fabric Object Store. Inside the bucket is a tpch002 directory with nation and nationalhistory sub-directories that contain Parquet files.

The following image shows the HPE Ezmeral Data Fabric Object Store prestodemo bucket and directories within the bucket:

Creating the HPE Ezmeral Unified Analytics Software connection to HPE Ezmeral Data Fabric Object Store requires the following information from HPE Ezmeral Data Fabric Object Store:
  • Access key
  • Secret key
  • S3 directory where files are stored (Example: s3://prestodemo/tpch002)
  • S3 endpoint (This is the HPE Ezmeral Data Fabric Object Store IP address and port, for example: https://10.10.10.100:9000

To connect HPE Ezmeral Unified Analytics Software to an S3 Directory in an external HPE Ezmeral Data Fabric Object Store:

  1. Sign in to HPE Ezmeral Unified Analytics Software.
  2. In the left navigation bar, select Data Engineering > Data Sources.
  3. On the Structured Data tab, click Add New.
  4. In the Hive tile, click Create Connection.
  5. In the drawer that opens, complete the required fields:
    TIP
    • In the drawer, search for and add the fields you do not see. Use the search field under Optional Fields to find and add these fields.
    • No external Hive Metastore is required; HPE Ezmeral Unified Analytics Software internally parses and scans the files to get the data types and metadata.
    Field Name Example Notes
    Name demoparquet Unique name for the data source connection.
    Hive Metastore Discovery System scans files to discover metadata and data types.
    Data Dir s3://prestodemo/tpch002 S3 directory where files are stored.
    File Type Parquet You can select Parquet or CSV depending on the type of data in the S3 directory.
    Hive S3 AWS Access Key The access key generated by the Data Fabric Object Store. Under Optional Fields, search for this field in the search box and select it.
    Hive S3 AWS Secret Key The secret key generated by the Data Fabric Object Store. Under Optional Fields, search for this field in the search box and select it.
    Hive S3 Endpoint https://10.10.10.100:9000 The Data Fabric Object Store connection URL. Under Optional Fields, search for this field in the search box and select it.
    Hive S3 Path Style Access Select the checkbox. Under Optional Fields, search for this field in the search box and select it.
  6. Click Connect. The system displays the message, "Successfully added data source," and the new data source (demoparquet) tile appears on the Data Sources page.

  7. In the new data source tile (demoparquet), click Query using Data Catalog. You can see the demoparquet source listed.

  8. Select the schema (default) under the data source to view the data sets.
    NOTE
    Each subdirectory in the S3 directory displays as a table in HPE Ezmeral Unified Analytics Software, and each Parquet file in the subdirectory is a row in the table.
  9. Click on a table (nation) and then click the Data Preview tab to view the Parquet data.