Configuring Data Fabric to Track User Behavior
Describes how to configure Data Fabric to track user behavior.
Introduction
When auditing is enabled in Data Fabric, files, streams, S3 objects, and tables can be audited for cluster administration and/or data access operations. See Enabling and Disabling Auditing of Cluster Administration to enable auditing for cluster administration.
See Streaming Audit Logs for information on audit log streaming. See Enabling and Disabling Audit Streaming Using the CLI to enable audit streaming.
Data Fabric generates audit logs for the services that are related to various Data Fabric components. The services include CLDB, S3, MFS, auth. Auth logs are authentication audit logs.
Auditing is useful to record user behavior and assists in tracking anomalies or potential data security threats with respect to Data Fabric. See Auditing in Data Fabric for information on auditing.
See Log files on information about the various log files generated by Data Fabric.
See Viewing Audit Logs for information on how to view audit logs in Data Fabric.
Data Fabric audit logs provide insights into the activity that has taken place in relation to a cluster. The insight service is available as a distributed service on a cluster/fabric, when run in production mode.
The audit logs are stored on nodes on which the respective Data Fabric service runs. This makes it cumbersome to establish a correlation between the various logs.
Data Fabric stores audit logs in files and the audit logs can be directed to streams. However, it is not possible to run queries on streams data. Hence, the insight service picks the streams data and adds the audit data on to the respective Apache Iceberg tables.
Tools like Apache Spark can be used to run queries on the audit log data stored on Apache Iceberg, and tools like Apache Zeppelin can be used to provide graphical insights. Apache Zeppelin can make use of customizable queries to generate dashboards.
mapr-hivemetastore
package. When you wish to run the insight service in a trial mode, enable audit streams so that the audit logs from the audit log files are available on streams as well. The audit data can be transferred to Apache Iceberg tables and this data can be further analyzed by using Apache Spark or Apache Zeppelin.
You can write applications that consume the audit log data stored on Iceberg to detect anomalies in user behavior.
The insight service can be installed from mapr-insight
package
available on the HPE website that hosts the MEP packages.
The insight service uses Hive Metastore to store and manage the Iceberg catalog. Hive
Metastore must be accessible to insight service for storing audit logs in Iceberg
table. You must have the mapr-hivemetastore
package installed and
configured on your cluster to be able to use the insight service.
Hive Metastore requires a relational database management system like MySQL in production setups. See Using MySQL for the Hive Metastore to use MySQL with Hive Metastore.
To set up MySQL to work with the Hive Metastore and Data Fabric, see Configuring a Remote MySQL Database for Hive Metastore.
Data Fabric supports other production grade databases like PostgreSQL. See Configuring Data Fabric for Hive Metastore for details.
Types of Audit Logs Copied for Analysis
The following audit logs are expanded using a variant of expand-audit
utility to
include user-friendlier versions of uids, volids, etc. (usernames, volume names,
etc.) before committing the logs to Iceberg table.
- CLDB logs
- MFS logs
- authentication logs
- S3 logs
There are four distinct Iceberg tables designated for each type of audit log stream. See Enabling Insight Gathering in Trial Mode and Enabling Insight Gathering in Production Mode for the Iceberg table names for each of the modes in which the insight gathering can be operated.
Configure Data Fabric to Track User Behavior
The insight service can be enabled to gather insights at the cluster level, type level, and the node level.
If you do not want data from nodes to be copied to Iceberg, you can disable audit
log/insights from node. By default, insights are disabled. When insight feature is
enabled at the global level, audit logs for all types and
all nodes are committed to Iceberg tables periodically. insightinsight
commands are available to
enable/disable insights based on the type of logs, or at the node level on a
node-by-node basis.
You can use tools like Spark and Zeppelin to run queries on the Iceberg tables to generate various reports and charts required by you to detect any anomalies in user behavior related to the data access operations and cluster administration.
You can customize the insights with the insight CLI command.
For instance, you can turn off insight gathering on some nodes (although this is not recommended), or you can turn off insight gathering of certain audit components such as S3 due to heavy S3 traffic on your cluster/fabric.
Enable Insight Gathering
See insight cluster to enable insight gathering.