Configuring Data Fabric to Track User Behavior

Describes how to configure Data Fabric to track user behavior.

Introduction

When auditing is enabled in Data Fabric, files, streams, S3 objects, and tables can be audited for cluster administration and/or data access operations. See Enabling and Disabling Auditing of Cluster Administration to enable auditing for cluster administration.

NOTE
Audit logs are generated into audit log files when auditing is enabled.

See Streaming Audit Logs for information on audit log streaming. See Enabling and Disabling Audit Streaming Using the CLI to enable audit streaming.

Data Fabric generates audit logs for the services that are related to various Data Fabric components. The services include CLDB, S3, MFS, auth. Auth logs are authentication audit logs.

Auditing is useful to record user behavior and assists in tracking anomalies or potential data security threats with respect to Data Fabric. See Auditing in Data Fabric for information on auditing.

See Log files on information about the various log files generated by Data Fabric.

See Viewing Audit Logs for information on how to view audit logs in Data Fabric.

Data Fabric audit logs provide insights into the activity that has taken place in relation to a cluster. The insight service is available as a distributed service on a cluster/fabric, when run in production mode.

The audit logs are stored on nodes on which the respective Data Fabric service runs. This makes it cumbersome to establish a correlation between the various logs.

Data Fabric stores audit logs in files and the audit logs can be directed to streams. However, it is not possible to run queries on streams data. Hence, the insight service picks the streams data and adds the audit data on to the respective Apache Iceberg tables.

HPE recommends you to have Spark and Zeppelin installed. With these services made available, Insight service will help generate some standard graphs which represent the activities going on Data Fabric

NOTE
Data Fabric creates the relevant Iceberg tables for storage of insight data on successful installation of the mapr-hivemetastore package.
IMPORTANT
The insight service must be installed on all nodes of your cluster/fabric to effectively gather information about activities and events recorded in the audit log for each node. All the data gathered by the insight service is stored in a single set of Iceberg tables.

You can write applications that consume the audit log data stored on Iceberg to detect anomalies in user behavior.

The insight service can be installed from mapr-insight package available on the HPE website that hosts the Data Fabric packages.

The insight-related Iceberg tables are stored on Data Fabric and use Iceberg Catalog as Hive via Hive Metastore service which needs to be configured with RDBMS for production or use the default database (Derby) for trial purposes.

Insight related Iceberg tables are placed in default namespace of Hive Meta Store.

Types of Audit Logs Copied for Analysis

The following audit logs are expanded using a variant of expand-audit utility to include user-friendlier versions of uids, volids, etc. (usernames, volume names, etc.) before committing the logs to Iceberg table.

  • CLDB logs
  • MFS logs
  • authentication logs
  • S3 logs

There are four distinct Iceberg tables designated for each type of audit log stream. See Enabling Insight Gathering in Trial Mode and Enabling Insight Gathering in Production Mode for the Iceberg table names for each of the modes in which the insight gathering can be operated.

Configure Data Fabric to Track User Behavior

The insight service can be enabled to gather insights at the cluster level, type level, and the node level.

If you do not want data from nodes to be copied to Iceberg, you can disable audit log/insights from node. By default, insights are disabled. When insight feature is enabled at the global level, audit logs for all types and all nodes are committed to Iceberg tables periodically. insight commands are available to enable/disable insights based on the type of logs, or at the node level on a node-by-node basis.

IMPORTANT
Install and set up your cluster before installing the packages for the insight service and enabling the insight service.

You can customize the insights with the insight CLI command.

For instance, you can turn off insight gathering on some nodes (although this is not recommended), or you can turn off insight gathering of certain audit components such as S3 due to heavy S3 traffic on your cluster/fabric.

Enable Insight Gathering

The insight service gathers audit data and commits to respective Iceberg table every five minutes, by default. The audit records are collected put into a data file in a batch of 1024 records before pushing the data files to Apache Iceberg at the five-minute interval, by default.

NOTE
See insight to change the default interval, data file buffer size settings, and other insight-related parameters. See insight cluster to enable insight gathering.
TIP
Purging of Apache Iceberg table records happens periodically by one insight node. The purge operation is performed once every hour and the records older than 5 days are purged, by default. See to change the default purge frequency. The retention period for the Iceberg tables can be configured using the command.

The insight gathering stops when Hive Metastore is down and waits for the Hive Metastore service to be up and running, before the insight service can commit records to the respective Apache Iceberg table. When Hive Metastore is running in high availability mode, Data Fabric communicates any switchover of the Hive Metastore master to insight services running on the individual nodes.