EzPresto
Describes the EzPresto SQL query engine and its featues.
EzPresto in HPE Ezmeral Unified Analytics Software
EzPresto is an SQL query engine based on the open-source, Linux foundation multi-parallel processing (MPP) query engine PrestoDB, that is optimized to run federated queries across various data sources. Enterprise BI applications such as Tableau, Power BI, and data processing engines, such as Spark, can leverage EzPresto for rapid query performance and prompt insights through federated data access.
You can easily connect EzPresto to multiple types of data sources from the Data Engineering space in HPE Ezmeral Unified Analytics Software by going to Data Engineering > Data Sources. Connections require a JDBC connection URL and user credentials.
Data sets available to the connected user display in the Data Catalog, which is accessible by going to Data Engineering > Data Catalog. In the Data Catalog, you select the data sets you want to work with. You can query or cache the selected datasets.
When you opt to cache data sets, you can modify the data sets prior to caching them. For example, you can edit table and column names, remove columns, and create new schema. Cached data sets (tables and views) are accessible in the Cached Assets space of HPE Ezmeral Unified Analytics Software. You can access cached assets by going to Data Engineering > Cached Assets.
When you opt to query data sets, you can run federated queries (query across data sets in multiple data sources) from the Query Editor. You can access the Query Editor by going to Data Engineering > Query Editor. Querying cached data sets accelerates queries for significant performance gains.
You can access the data in connected data sources from Superset and visualize the data that results from complex, federated queries. Superset is accessible in HPE Ezmeral Unified Analytics Software by going to BI Reporting > Dashboards or Tools & Frameworks > Data Engineering tab and clicking Open in the Superset tile. See Superset. You can also monitor the state of queries and query details, including the query plan and resource usage, by going to Administration > EzSQL Cluster Monitoring.
EzPresto Key Features
- Data Source Connectivity
- EzPresto includes connectors for
several data sources, including:
- HPE Ezmeral Data Fabric
- HDFS
- Data Lakes
- Hive Metastore (including managed HMS services such as AWS Glue)
- Object Stores
- Relational Databases
- NoSQL Databases
- Streaming data platforms
- Data warehouses
- Built-In Data Catalog
- The built-in data catalog provides dynamic registration of new data sources. Data administrators can add new data sources as they become available without restarting any services. When a data administrator adds a new data source, the data catalog automatically refreshes so users, such as data analysts, can browse the new datasets and perform upstream activities, such as reporting and dashboarding.
- Role-Based Access Controls
- Role-based access controls isolate queries such that members (non-admin users) can only view, access, and cancel their own queries. Admin users have full access to all queries. For example, if a member runs a query that takes too long to complete or uses too many resources, any admin in HPE Ezmeral Unified Analytics Software can stop the query.
- Optimized Federated Queries
- Access data across disparate data sources in a single, optimized query. Query
optimizations for accelerated performance include:
- Predicate pushdown - EzPresto pushes filters in the WHERE clause down to the data source for processing to reduce the number of rows returned.
- Projection pushdown - EzPresto pushes projects (scanning of selected columns) down to the data source for processing to reduce the amount of data returned.
- Dynamic filtering - EzPresto evaluates predicates on the right side of a join and pushes them to the left side of the join to reduce the number of rows scanned from the left table.
- Cost-based optimization - EzPresto uses table statistics to calculate the cost (resource usage) of various query plans and chooses the optimal plan (plan that uses the least resources) to run the query.
- Distributed Caching
- EzPresto accelerates federated queries
through distributed caching of commonly used datasets. EzPresto currently supports explicit
caching where you manually modify tables and select the data that you want
cached for fast query access. You can use explicitly cached data for data modeling.
EzPresto stores cached data in a data
fabric volume. The cache expires based on the set TTL (time-to-live). See Connecting Data Sources and Caching Data.
- Explicit Caching
- You manually modify tables and select the data in the tables that you want stored in the cache for fast query access. You can use explicitly cached data for data modeling. You can set a TTL (time to live) for the cache.
- Self-Service Data Access
- End-users can browse data sets they have access to and select the relevant data for their queries and analytical applications and workloads.
- Run-Anywhere Architecture
- EzPresto has a run-anywhere architecture; you can run EzPresto on-premises, on edge, in the cloud, or hybrid environments.
EzPresto Architecture
- Presto
- EzPresto uses a modified version of Presto as the query engine. Most of the modifications are in the query planning and optimizer areas, as well as support for different data sources, such as Teradata and Snowflake, and in-process caching based on Apache Geode, tuned for OLTP and OLAP access. The cache provides a tuple store with specialized columnar formats.
- WebService
- Provides the API.
- Web UI
- Provides the ability to access EzPresto in applications.
- Client Connections
- Provides the ability to connect to BI tools and external data sources via the JDBC client.
- KeyCloak
- KeyCloak provides the authentication mechanism and different authentication options, such as LDAP and JWT.