Caching Data

Describes data caching and provides the steps for caching data in HPE Ezmeral Unified Analytics Software.

Caching data reduces latency. You can pre-load frequently accessed data into the cache to improve the performance of queries on the data. Caching is useful when network latency is an issue due to firewalls.

When queries run against data or tables, the query engine automatically checks for cached data and uses it if present. Cache optimization works when queries reference remote tables. Queries issued against cached data do not require optimization.

The cache lasts for the duration of the TTL (time-to-live). The user that connects HPE Ezmeral Unified Analytics Software to a data source selects the caching option (Enable Local Snapshot Table) and sets the TTL for the cache. The default TTL is one day (set in minutes).

HPE Ezmeral Unified Analytics Software stores cached data in an HPE Ezmeral Data Fabric volume. You can view and access cached data in the HPE Ezmeral Unified Analytics Software UI by going to Data Engineering > Cached Assets in the left navigation bar.

When you cache data, you can modify the data sets before caching them. The following list describes some of the changes that you can make to a data set:
  • Edit the data set name
  • Remove columns from the data set
  • Edit column names
  • Change the schema or add a new schema
  • Apply a schema to the selected data sets
IMPORTANT
  • Cached data is only available to the user that cached the data. Other users that sign in to HPE Ezmeral Unified Analytics Software cannot access the data that you cache.
  • If data in the underlying data sources change, HPE Ezmeral Unified Analytics Software does not automatically update the cache. You must cache the data again to refresh the cache.

How to Cache Data

HPE Ezmeral Unified Analytics Software must be connected to the data sources with the data sets that you want to cache. See Connecting Data Sources.

To cache data, complete the following steps:
  1. Sign in to HPE Ezmeral Unified Analytics Software.
  2. In the left navigation bar, select Data Engineering > Data Catalog.
  3. In the Connected Data Sources area, select the data sources with the data that you want to cache. The data sets available to you in the selected data sources displays in the All Datasets area.
  4. Optionally, search or filter the data sets to find the data set(s) that you want to cache.
  5. Click + Select for each of the data sets that you want to cache.
  6. Click Selected Datasets. The Selected Datasets drawer opens and displays the selected data sets.
  7. Click Cache Datasets. The Manage Datasets screen appears. Each data set that you selected appears on its own tab.
  8. Optionally, modify the data set(s).
    • Use the pencil icon to modify data set and column names.
    • Use the check boxes next to the column names to remove columns from the data set.
    • Use the Schema dropdown to change the schema or add a new schema.
    • If you have selected multiple data sets, use the connector icon next to the schema dropdown to apply the schema to all of the selected data sets.
  9. Click Cache Overview and compare the original data sets (Input Assets) to the modified data sets (Output Assets) to verify the changes.
  10. If the changes to a data set are incorrect, click the pencil icon to edit the data set.
  11. To cache the data set(s), click Save to cache. The system displays the following message:
    Successfully initiating cache
    If an error appears, correct the issue and continue.
  12. To view the cached data sets, go to Data Engineering > Cached Assets in the left navigation bar of the HPE Ezmeral Unified Analytics Software UI.
    NOTE
    Depending on the size of the data sets, it may take a minute or so for them to appear as cached assets.

Enable or Disable a Cache

You can enable or disable caching through the Enable Local Snapshot Table option when you create a data source connection. See Connecting Data Sources.

You cannot disable caching by setting the TTL to zero. If the TTL is set to zero, the cache expires immediately but still consumes resources.