Configuring Data Fabric to Track User Behavior
Describes how to configure Data Fabric to track user behavior.
Introduction
When auditing is enabled in Data Fabric, files, streams, S3 objects, and tables can be audited for cluster administration and/or data access operations. See Enabling and Disabling Auditing of Cluster Administration to enable auditing for cluster administration.
See Streaming Audit Logs for information on audit log streaming. See Enabling and Disabling Audit Streaming Using the CLI to enable audit streaming.
Data Fabric generates audit logs for the services that are related to various Data Fabric components. The services include CLDB, S3, MFS, auth. Auth logs are authentication audit logs.
Auditing is useful to record user behavior and assists in tracking anomalies or potential data security threats with respect to Data Fabric. See Auditing in Data Fabric for information on auditing.
See Log files on information about the various log files generated by Data Fabric.
See Viewing Audit Logs for information on how to view audit logs in Data Fabric.
Data Fabric audit logs provide insights into the activity that has taken place in relation to a cluster. The insight service is available as a distributed service on a cluster/fabric, when run in production mode.
The audit logs are stored on nodes on which the respective Data Fabric service runs. This makes it cumbersome to establish a correlation between the various logs.
Data Fabric stores audit logs in files and the audit logs can be directed to streams. However, it is not possible to run queries on streams data. Hence, the insight service picks the streams data and adds the audit data on to the respective Apache Iceberg tables.
HPE recommends you to have Spark and Zeppelin installed. With these services made available, Insight service will help generate some standard graphs which represent the activities going on Data Fabric
mapr-hivemetastore
package. You can write applications that consume the audit log data stored on Iceberg to detect anomalies in user behavior.
The insight service can be installed from mapr-insight
package
available on the HPE website that hosts the Data Fabric packages.
The insight-related Iceberg tables are stored on Data Fabric and use Iceberg Catalog as Hive via Hive Metastore service which needs to be configured with RDBMS for production or use the default database (Derby) for trial purposes.
Insight related Iceberg tables are placed in default namespace of Hive Meta Store.
Types of Audit Logs Copied for Analysis
The following audit logs are expanded using a variant of expand-audit
utility to
include user-friendlier versions of uids, volids, etc. (usernames, volume names,
etc.) before committing the logs to Iceberg table.
- CLDB logs
- MFS logs
- authentication logs
- S3 logs
There are four distinct Iceberg tables designated for each type of audit log stream. See Enabling Insight Gathering in Trial Mode and Enabling Insight Gathering in Production Mode for the Iceberg table names for each of the modes in which the insight gathering can be operated.
Configure Data Fabric to Track User Behavior
The insight service can be enabled to gather insights at the cluster level, type level, and the node level.
If you do not want data from nodes to be copied to Iceberg, you can disable audit log/insights from node. By default, insights are disabled. When insight feature is enabled at the global level, audit logs for all types and all nodes are committed to Iceberg tables periodically. insight commands are available to enable/disable insights based on the type of logs, or at the node level on a node-by-node basis.
You can customize the insights with the insight CLI command.
For instance, you can turn off insight gathering on some nodes (although this is not recommended), or you can turn off insight gathering of certain audit components such as S3 due to heavy S3 traffic on your cluster/fabric.
Enable Insight Gathering
The insight service gathers audit data and commits to respective Iceberg table every five minutes, by default. The audit records are collected put into a data file in a batch of 1024 records before pushing the data files to Apache Iceberg at the five-minute interval, by default.
The insight gathering stops when Hive Metastore is down and waits for the Hive Metastore service to be up and running, before the insight service can commit records to the respective Apache Iceberg table. When Hive Metastore is running in high availability mode, Data Fabric communicates any switchover of the Hive Metastore master to insight services running on the individual nodes.