Using Spark to Query EzPresto

Describes how to use Spark to query EzPresto.

Using Spark to Query EzPresto with the EzPresto Plugin

The EzPresto plugin retrieves and processes vast datasets quickly and efficiently, making it suitable for large-scale data operations. The Spark images packaged with HPE Ezmeral Unified Analytics Software include the Presto client library. The Presto client library is required to connect Spark to EzPresto. When you connect Spark to EzPresto, you do not have to explicitly provide a username and password because the bundled EzPresto client library automatically sets and manages credential renewal.

Spark runtimes do not have the EzPresto certificate in the truststore. When you connect Spark to EzPresto, you must set the ignore_ssl_check option to true.

The following Python and Scala program examples demonstrate how to submit a Spark query through EzPresto that reads data into a DataFrame:

Python
Scala

dfReader = spark.read.format("EzPresto")
dfReader.option("presto_url", "https://ezpresto-sts-mst-0.ezpresto-svc-hdl.ezpresto.svc.cluster.local:8081")
dfReader.option("dal_url", "https://ezpresto-sts-mst-0.ezpresto-svc-hdl.ezpresto.svc.cluster.local:9090")
dfReader.option("ignore_ssl_check", "true")
dfReader.option("query", "select * from mysql.tpch_100gb_std_presto.lineitem")

df = dfReader.load()
df.show()

val query = "select * from mysql.tpch_100gb_std_presto.lineitem"
val dfReader = spark.read.format("EzPresto")
dfReader.option("presto_url", "https://ezpresto-sts-mst-0.ezpresto-svc-hdl.ezpresto.svc.cluster.local:8081")
dfReader.option("dal_url", "https://ezpresto-sts-mst-0.ezpresto-svc-hdl.ezpresto.svc.cluster.local:9090")
dfReader.option("ignore_ssl_check", "true")
dfReader.option("query", query)
val df = dfReader.load()
df.show()

Using Spark to Query EzPresto via Legacy Method

Spark images packaged with HPE Ezmeral Unified Analytics Software include the Presto client library required to connect Spark to EzPresto.

Spark runtimes do not have the EzPresto certificate in the truststore; therefore, when you connect Spark to EzPresto, you must set the IgnoreSSLChecks option to true.

For open-source Spark use cases, the CA certificate must be available in the JVM truststore. You can set a custom path to the truststore using the jdbc SSLTrustStorePath option. Note that you must use JKS format.

The following Python and Scala program examples demonstrate how to submit a Spark query through EzPresto that reads data into a DataFrame:

Python
Scala

DOMAIN = "<your-unified-analytics-domain-name>"
user = "<username>"
catalog = "mysql"
table = "tpch_100gb_std_presto"
uri = f"jdbc:presto://ezpresto.{DOMAIN}:443/{catalog}/{table}"
query = "select * from mysql.tpch_100gb_std_presto.lineitem"
df = spark.read.format("jdbc"). \
      option("driver", "com.facebook.presto.jdbc.PrestoDriver"). \
      option("url", uri). \
      option("user", user). \
      option("SSL", "true"). \
      option("IgnoreSSLChecks", "true"). \
      option("query", query). \
      load().show()

val DOMAIN = "<your-unified-analytics-domain-name>"
val user = "<username>"
val catalog = "mysql"
val table = "tpch_100gb_std_presto"
val uri = s"jdbc:presto://ezpresto.$DOMAIN:443/$catalog/$table"
val query = "select * from mysql.tpch_100gb_std_presto.nation"
val df = spark.read.format("jdbc").
      option("driver", "com.facebook.presto.jdbc.PrestoDriver").
      option("url", uri).
      option("user", user).
      option("SSL", "true").
      option("IgnoreSSLChecks", "true").
      option("query", query).
      load().show()

Limitations

The EzPresto connector for Spark has the following limitations:

Does not support write operations.
Does not support queries that require ordering of results, such as sort by or order by queries. While the query does not fail, the ordering is not maintained.
You must always use aliases in SQL aggregations to ensure proper functionality. For example, replace COUNT() with COUNT() AS col_name.

HPE Ezmeral Unified Analytics Software 1.5 Documentation
Abstract	HPE Ezmeral Unified Analytics Software is a usage-based Software-as-a-Service (SaaS) model that operationalizes hybrid and multi-cloud modern analytical workloads through a simple user interface, easily installed and deployed in minutes. HPE Ezmeral Unified Analytics Software separates compute and storage for flexible, cost-efficient scalability to securely access data stored in multiple data platforms, enabling you to run traditional and advanced analytics workloads with open-source tools.
Published	July 2025
Edition	1.5.0
Topic last updated	2024-11-05