EzPresto

Describes how to identify and debug issues for EzPresto .

Cannot create Iceberg connections with hadoop catalog type from the UI

HPE Ezmeral Unified Analytics Software supports Iceberg connections with the hadoop catalog type. However, you cannot create the Iceberg connection through the HPE Ezmeral Unified Analytics Software UI. You must create the connection from the command line using a curl command with a JSON configuration that calls the EzPresto backend API to create the data source connection.

You can create Iceberg connections (with hadoop catalog type) to the following types of storage:
  • HPE Ezmeral Data Fabric Object Store
  • HPE Ezmeral Data Fabric File Store
  • Local or mounted file system that is locally accessible
To create an Iceberg connection with catalog type hadoop:
  1. Create a JSON configuration for your storage type, replacing all values in angle brackets (<>) with values for your environment. The following sections provide JSON configurations for each storage type:
    HPE Ezmeral Data Fabric Object Store
    {
        "catalogName": "<catalog_name>",
        "connectorName": "iceberg",
        "properties": {
            "iceberg.catalog.type": "hadoop",
            "iceberg.catalog.warehouse": "<S3 Warehouse Location>",
            "iceberg.catalog.cached-catalog-num": "10",
            "hive.s3.aws-access-key": "<S3 Access key>",
            "hive.s3.aws-secret-key": "<S3 Secret Key>",
            "hive.s3.endpoint": "<S3 End Point>",
            "hive.s3.path-style-access": true,
            "hive.s3.ssl.enabled": false
        }
    }
    HPE Ezmeral Data Fabric File Store
    {
      "catalogName": "<catalog_name>",
      "connectorName": "iceberg",
      "properties": {
        "iceberg.catalog.type": "hadoop",
        "hive.hdfs.authentication.type": "MAPRSASL",
        "df.cluster.details": "<DF Cluster Details>",
        "hive.hdfs.df.ticket":"<DF Cluster Ticker Details>",
        "iceberg.catalog.warehouse": "<MAPR FS Warehouse Location>"
      },
      "fileProperties": {
        "iceberg.hadoop.config.resources": [
          "PGNvbmZpZ3VyYXRpb24+CiAgICA8cHJvcGVydHk+CiAgICAgICAgPG5hbWU+ZnMubWFwcmZzLmltcGw8L25hbWU+CiAgICAgICAgPHZhbHVlPmNvbS5tYXByLmZzLk1hcFJGaWxlU3lzdGVtPC92YWx1ZT4KICAgIDwvcHJvcGVydHk+CjwvY29uZmlndXJhdGlvbj4="
        ]
      }
    }
    Local or mounted file system that is locally accessible
    {
        "catalogName": "<catalog_name>",
        "connectorName": "iceberg",
        "properties": {
            "iceberg.catalog.type": "hadoop",
            "iceberg.catalog.warehouse": "<Locally Mounted Warehouse Location>"
        }
    }
  2. Run the following curl command from any machine that can access the Unified Analytics cluster endpoint (https://<your-ua-cluster-domain>.com/v1/catalog):
    curl -u <username>:<password> --location '<EZPresto End Point>/v1/catalog' --header 'Content-Type: application/json' --insecure --data '<JSON DATA>'
    
    //<username>:<password> (Replace with your Unified Analytics username and password.)
    //<EzPresto End Point> (Go to Tools & Frameworks > Data Engineering > EzPresto to get the endpoint URL.)
    //<JSON DATA> (Use the JSON configuration from step 1.)
    IMPORTANT
    On a Unified Analytics 1.5.2 cluster, you must include a refresh token instead of a password. For example:
    curl -u <username>:<refresh_token> --location '<EZPresto End Point>/v1/catalog' --header 'Content-Type: application/json' --insecure --data '<JSON DATA>'
    To generate a refresh token, go to the following URL in an incognito browser:
    https://token-service.<your-ua-cluster-domain>/refresh-token-download
    When prompted, enter your Unified Analytics credentials. A refresh-token.txt file automatically downloads. This file contains the refresh token that you use when you run the curl command.

    You should now see the Iceberg connection in Unified Analytics by going to Data Engineering > Data Sources and clicking on tab that correlates with the data source type, such as Structured Data.

EzPresto installation fails due to mysql pod entering CrashLoopBackOff state

During EzPresto deployment, the HPE Ezmeral Unified Analytics Software installation fails due to slow disk I/O, which leads to the mysql pod in EzPresto entering a CrashLoopBackOff state.

When the mysql pod is deployed, a lifecycle hook expects the pod to be ready within thirty seconds. If the pod is not ready within thirty seconds, Kubernetes continuously tries to restart the pod which leads to the pod being in a CrashLoopBackOff state.

To resolve this issue, complete the following steps:
  1. Stop the mysql pod:
    kubectl scale deployment ezpresto-dep-mysql --replicas=0 -n ezpresto 
  2. Edit the mysql deployment:
    kubectl edit deployment ezpresto-dep-mysql -n ezpresto 
  3. Remove the following lifecycle hook:
    lifecycle: 
              postStart: 
                exec: 
                  command: 
                    - "sh" 
                    - "-c" 
                    - > 
                      sleep 30 ; 
                      mysql -u root -p$MYSQL_ROOT_PASSWORD -e "GRANT ALL PRIVILEGES ON *.* TO '$MYSQL_USER'@'%' WITH GRANT OPTION"; 
  4. Delete the mysql pvc:
    kubectl delete pvc ezpresto-pvc-mysql -n ezpresto 
  5. Create a file named mysql.pvc and copy the following content into the file:
    apiVersion: v1 
    kind: PersistentVolumeClaim 
    metadata: 
      annotations: 
        meta.helm.sh/release-name: ezpresto 
        meta.helm.sh/release-namespace: ezpresto 
        volume.beta.kubernetes.io/storage-provisioner: com.mapr.csi-kdf 
        volume.kubernetes.io/storage-provisioner: com.mapr.csi-kdf 
      labels: 
        app.kubernetes.io/managed-by: Helm 
      name: ezpresto-pvc-mysql 
      namespace: ezpresto 
    spec: 
      accessModes: 
      - ReadWriteMany 
      resources: 
        requests: 
          storage: 5Gi 
      storageClassName: edf 
      volumeMode: Filesystem
  6. Create a mysql pvc:
    kubectl apply -f mysql.pvc -n ezpresto 
  7. Start the mysql pods:
    kubectl scale deployment ezpresto-dep-mysql --replicas=1 -n ezpresto 
  8. Restart the web service pods:
    kubectl rollout restart deployment ezpresto-dep-web -n ezpresto 
Installation is complete and you can use EzPresto once all pods in the ezpresto namespace are running.

Trying to Access a Hive Directory Results in an Access Denied Error

Any schema created with impersonation returns an access denied error if the directory ownership is not set correctly for the impersonating user. To avoid access denied errors, correct the ownership/permissions on the directory before performing any operations:
hadoop fs [-chown [-R] [OWNER][:[GROUP]] PATH...]
hadoop fs [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
For example, SSH in to the HPE Ezmeral Data Fabric cluster node. If the mapr user ticket was used for hive impersonation, then it should be used for following operation:
export MAPR_TICKETFILE_LOCATION=/home/bob123/mapruserticket
hadoop fs -chown bob123:ldap maprfs://user01/user/hive/warehouse/foo.db
hadoop fs -chmod 775 maprfs://user01/user/hive/warehouse/foo.db

Cannot Add Iceberg as a Data Source when Catalog Type is Hadoop

Recent changes introduced by open source PrestoDB cause Iceberg data connections to fail in Unified Analytics when the Catalog Type is Hadoop.
Workaround for New Installation
To connect Unified Analytics to an Iceberg data source with Catalog Type set as Hadoop, complete the following steps:
  1. To update the EzPresto images, run the following kubectl commands:
    kubectl set image statefulset/ezpresto-sts-mst presto-coordinator=marketplace.us1.greenlake-hpe.com/ezua/gcr.io/mapr-252711/ezsql-test/presto-0.285-fy24-q2:0.0.61 --namespace=ezpresto
    
    kubectl set image statefulset/ezpresto-sts-wrk presto-worker=marketplace.us1.greenlake-hpe.com/ezua/gcr.io/mapr-252711/ezsql-test/presto-0.285-fy24-q2:0.0.61 --namespace=ezpresto
    
  2. Sign in to Unified Analytics and add the Iceberg data source with the Catalog Type set as Hadoop.
Workaround for Upgrade
If you want to upgrade Unified Analytics from version 1.3 to 1.4, and you have an Iceberg data source in place with Catalog Type set as Hadoop, complete the following steps:
  1. Sign in to Unified Analytics.
  2. Delete the Iceberg connection.
  3. Upgrade to Unified Analytics version 1.4.
  4. To update the EzPresto images, run the following kubectl commands:
    kubectl set image statefulset/ezpresto-sts-mst presto-coordinator=marketplace.us1.greenlake-hpe.com/ezua/gcr.io/mapr-252711/ezsql-test/presto-0.285-fy24-q2:0.0.61 --namespace=ezpresto
    
    kubectl set image statefulset/ezpresto-sts-wrk presto-worker=marketplace.us1.greenlake-hpe.com/ezua/gcr.io/mapr-252711/ezsql-test/presto-0.285-fy24-q2:0.0.61 --namespace=ezpresto
    
  5. Sign in to Unified Analytics and add the Iceberg data source with the Catalog Type set as Hadoop.

Insufficient Memory

Currently, the maximum memory available to queries is based on the memory resources of a single worker node instead of total cluster memory (all worker nodes). As a result, queries may fail due to insufficient memory. To address this issue, modify the EzPresto configuration as described in the following steps:
  1. In the left navigation bar, go to Tools & Frameworks > Data Engineering > EzPresto.
  2. Click on the three dots and select Configure.
  3. In window that appears, remove the entire cmnConfigMaps section and replace it with the following:
    cmnConfigMaps:
      # Configmaps common to both Presto Master and Worker
      logConfig:
        log.properties: |
          # Enable verbose logging from Presto
          #com.facebook.presto=DEBUG
     
      # Configmaps specific to Presto Master
      prestoMst:
        cmnPrestoCoordinatorConfig:
          config.properties: |
            http-server.http.port={{ tpl .Values.ezsqlPresto.locatorService.locatorSvcPort $ }}
            discovery.uri=http://{{ tpl .Values.ezsqlPresto.locatorService.fullname $ }}:{{ tpl .Values.ezsqlPresto.locatorService.locatorSvcPort $ }}
            coordinator=true
            node-scheduler.include-coordinator=false
            discovery-server.enabled=true
            catalog.config-dir = {{ .Values.ezsqlPresto.stsDeployment.volumeMount.mountPathCatalog }}
            catalog.disabled-connectors-for-dynamic-operation=drill,parquet,csv,salesforce,sharepoint,prestodb,raptor,kudu,redis,accumulo,elasticsearch,redshift,localfile,bigquery,prometheus,mongodb,pinot,druid,cassandra,kafka,atop,presto-thrift,ampool,hive-cache,memory,blackhole,tpch,tpcds,system,example-http,jmx
            generic-cache-enabled=true
            transparent-cache-enabled=false
            generic-cache-catalog-name=cache
            generic-cache-change-detection-interval=300
            catalog.config-dir.shared=true
            node.environment=production
            plugin.dir=/usr/lib/presto/plugin
            log.output-file=/data/presto/server.log
            log.levels-file=/usr/lib/presto/etc/log.properties
            query.max-history=1000
            query.max-stage-count=1000
            query.max-memory={{ mulf 0.6 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) ( .Values.ezsqlPresto.stsDeployment.wrk.replicaCount ) | floor }}MB
            query.max-total-memory={{ mulf 0.7 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) ( .Values.ezsqlPresto.stsDeployment.wrk.replicaCount ) | floor }}MB
            # query.max-memory-per-node={{ mulf 0.5 ( tpl .Values.ezsqlPresto.configMapProp.mst.jvmProp.maxHeapSize . ) | floor }}MB
            # query.max-total-memory-per-node={{ mulf 0.6 ( tpl .Values.ezsqlPresto.configMapProp.mst.jvmProp.maxHeapSize . ) | floor }}MB
            # memory.heap-headroom-per-node={{ mulf 0.3 ( tpl .Values.ezsqlPresto.configMapProp.mst.jvmProp.maxHeapSize . ) | floor }}MB
            experimental.spill-enabled=false
            experimental.spiller-spill-path=/tmp
            orm-database-url=jdbc:sqlite:/data/cache/metadata.db
            plugin.disabled-connectors=accumulo,atop,cassandra,example-http,kafka,kudu,localfile,memory,mongodb,pinot,presto-bigquery,prestodb,presto-druid,presto-elasticsearch,prometheus,raptor,redis,redshift
            log.max-size=100MB
            log.max-history=10
            discovery.http-client.max-requests-queued-per-destination=10000
            dynamic.http-client.max-requests-queued-per-destination=10000
            event.http-client.max-requests-queued-per-destination=10000
            exchange.http-client.max-requests-queued-per-destination=10000
            failure-detector.http-client.max-requests-queued-per-destination=10000
            memoryManager.http-client.max-requests-queued-per-destination=10000
            node-manager.http-client.max-requests-queued-per-destination=10000
            scheduler.http-client.max-requests-queued-per-destination=10000
            workerInfo.http-client.max-requests-queued-per-destination=10000
     
      # Configmaps specific to Presto Worker
      prestoWrk:
        prestoWorkerConfig:
          config.properties: |
            coordinator=false
            http-server.http.port={{ tpl .Values.ezsqlPresto.locatorService.locatorSvcPort $ }}
            discovery.uri=http://{{ tpl .Values.ezsqlPresto.locatorService.fullname $ }}:{{ tpl .Values.ezsqlPresto.locatorService.locatorSvcPort $ }}
            catalog.config-dir = {{ .Values.ezsqlPresto.stsDeployment.volumeMount.mountPathCatalog }}
            catalog.disabled-connectors-for-dynamic-operation=drill,parquet,csv,salesforce,sharepoint,prestodb,raptor,kudu,redis,accumulo,elasticsearch,redshift,localfile,bigquery,prometheus,mongodb,pinot,druid,cassandra,kafka,atop,presto-thrift,ampool,hive-cache,memory,blackhole,tpch,tpcds,system,example-http,jmx
            generic-cache-enabled=true
            transparent-cache-enabled=false
            generic-cache-catalog-name=cache
            catalog.config-dir.shared=true
            node.environment=production
            plugin.dir=/usr/lib/presto/plugin
            log.output-file=/data/presto/server.log
            log.levels-file=/usr/lib/presto/etc/log.properties
            query.max-memory={{ mulf 0.6 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) ( .Values.ezsqlPresto.stsDeployment.wrk.replicaCount ) | floor }}MB
            query.max-total-memory={{ mulf 0.7 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) ( .Values.ezsqlPresto.stsDeployment.wrk.replicaCount ) | floor }}MB
            query.max-memory-per-node={{ mulf 0.5 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) | floor }}MB
            query.max-total-memory-per-node={{ mulf 0.6 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) | floor }}MB
            memory.heap-headroom-per-node={{ mulf 0.2 ( tpl .Values.ezsqlPresto.configMapProp.wrk.jvmProp.maxHeapSize . ) | floor }}MB
            experimental.spill-enabled=false
            experimental.spiller-spill-path=/tmp
            orm-database-url=jdbc:sqlite:/data/cache/metadata.db
            plugin.disabled-connectors=accumulo,atop,cassandra,example-http,kafka,kudu,localfile,memory,mongodb,pinot,presto-bigquery,prestodb,presto-druid,presto-elasticsearch,prometheus,raptor,redis,redshift
            log.max-size=100MB
            log.max-history=10
            discovery.http-client.max-requests-queued-per-destination=10000
            event.http-client.max-requests-queued-per-destination=10000
            exchange.http-client.max-requests-queued-per-destination=10000
            node-manager.http-client.max-requests-queued-per-destination=10000
            workerInfo.http-client.max-requests-queued-per-destination=10000
    ### values_cmn_configmap.yaml contents END
  4. Click Configure to update the configuration on each of the presto pods and restart the pods. This operation takes a few minutes.
If this workaound does not resolve the issue, contact HPE Support.

Failed Queries

If queries fail, go to the Presto UI and view the stack trace for the queries. You can also view the EzPresto log files.

You can access the Presto UI from the HPE Ezmeral Unified Analytics Software UI.

  1. In the left navigation bar, select Tools & Frameworks.
  2. Select the Data Engineering tab.
  3. In the EzPresto tile, click on the Endpoint URL.
  4. In the Presto UI, select the Failed state.
  5. Locate the query and click on the Query ID.
  6. Scroll down to the Error Information section to view the stack trace.

You can also view the logs in the shared directory.

  1. In the left navigation bar, select Data Engineering > Data Sources.
  2. On the Data Sources screen, click Browse.
  3. Select the following directories in the order shown:
    1. shared/
    2. logs/
    3. apps/
    4. app-core/
    5. ezpresto/
  4. Select the log directory for which you want to view EzPresto logs.

Hive Data Source Connection Failure (S3-Based External Data Souce)

The following sections describe some issues that can cause Hive connection failures when using Hive to connect to an external s3-based data source, such as HPE Ezmeral Data Fabric Object Store. A workaround is provided for each issue.
Files have 0 length
The folder that contains the CSV or Parquet files has files with 0 length. For example, the files are empty or they are like the files generated by Spark jobs (_SUCCESS).

Workaround: Remove the empty files.

CSV file with an empty line
A CSV file has an empty line either in the data or in the last line of the file.

Workaround: Remove the empty lines in the file.

S3 folder with incorrect MIME type
The S3 folder that contains the CSV and Parquet files was created through the HPE Ezmeral Data Fabric Object Store UI. In pre-1.3 versions of HPE Ezmeral Unified Analytics Software, EzPresto does not recognize the folders created through the HPE Ezmeral Data Fabric Object Store UI because the S3 folder MIME type is different than the type set by AWS s3cmd.
Workaround: Use AWS s3cmd to create a folder and upload files to a bucket in HPE Ezmeral Data Fabric Object Store, for example, s3://<bucket>/<folder1>/<folder2>/data.csv.
NOTE
You cannot put files directly in the Data Dir path that you specified when you created the Hive connection. You must create a folder within the Data Dir path that you specified and put files there. For example, if you entered s3://mytestbucket/ as the Data Dir, you must create a folder within that directory, such as s3://mytestbucket/data/ and put files there.

Data Source Connection Failure (File-Based)

If a file system-based data connection fails, verify that the storage or file location starts with the appropriate scheme, for example maprfs://, hdfs://, or file:/.