Configuring Data Fabric to Track User Behavior

Describes how to configure Data Fabric to be able to track user behavior.

When auditing is enabled in Data Fabric, files, streams, S3 objects, and tables can be audited for cluster administration and/or data access operations. See Enabling and Disabling Auditing of Cluster Administration to enable auditing for cluster administration.

NOTE
Audit logs are generated into audit log files when auditing is enabled.

See Streaming Audit Logs for information on audit log streaming. See Enabling and Disabling Audit Streaming Using the CLI to enable audit streaming.

Data Fabric generates audit logs for the services that are related to various Data Fabric components. The services include CLDB, S3, MFS, auth. Auth logs are authentication audit logs.

Auditing is useful to record user behavior and assists in tracking anomalies or potential data security threats with respect to Data Fabric. See Auditing in Data Fabric for information on auditing.

See Log files on information about the various log files generated by Data Fabric.

See Viewing Audit Logs for information on how to view audit logs in Data Fabric.

Data Fabric audit logs provide insights into the activity that has taken place in relation to a cluster. The insight service is available as a distributed service on a cluster/fabric, when run in production mode.

The audit logs are stored on nodes on which the respective Data Fabric service runs. This makes it cumbersome to establish a correlation between the various logs.

Data Fabric stores audit logs in files and the audit logs can be directed to streams. However, it is not possible to run queries on streams data. Hence, the insight service picks the streams data and adds the audit data on to the respective Apache Iceberg tables.

Tools like Apache Spark can be used to run queries on the audit log data stored on Apache Iceberg, and tools like Apache Zeppelin can be used to provide graphical insights. Apache Zeppelin can make use of customizable queries to generate dashboards.

NOTE
Data Fabric creates the relevant Iceberg tables for storage of insight data on successful installation of the mapr-hivemetastore package.

When you wish to run the insight service in a trial mode, enable audit streams so that the audit logs from the audit log files are available on streams as well. The audit data can be transferred to Apache Iceberg tables and this data can be further analyzed by using Apache Spark or Apache Zeppelin.

IMPORTANT
The insight service must be installed on all nodes of your cluster/fabric to effectively gather information about activities and events recorded in the audit log for each node. All the data gathered by the insight service is stored in a single set of Iceberg tables.

You can write applications that consume the audit log data stored on Iceberg to detect anomalies in user behavior.

The insight service can be installed from mapr-insight package available on the HPE website that hosts the MEP packages.

The insight service uses Hive Metastore to store and manage the Iceberg catalog. Hive Metastore must be accessible to insight service for storing audit logs in Iceberg table. You must have the mapr-hivemetastore package installed and configured on your cluster to be able to use the insight service.

Hive Metastore requires a relational database management system like MySQL in production setups. See Using MySQL for the Hive Metastore to use MySQL with Hive Metastore.

To set up MySQL to work with the Hive Metastore and Data Fabric, see Configuring a Remote MySQL Database for Hive Metastore.

Data Fabric supports other production grade databases like PostgreSQL. See Configuring Data Fabric for Hive Metastore for details.

Types of Audit Logs Copied for Analysis

The following audit logs are expanded using a variant of expand-audit utility to include user-friendlier versions of uids, volids, etc. (usernames, volume names, etc.) before committing the logs to Iceberg table.

  • CLDB logs
  • MFS logs
  • authentication logs
  • S3 logs

There are four distinct Iceberg tables designated for each type of audit log stream. See Enabling Insight Gathering in Trial Mode and Enabling Insight Gathering in Production Mode for the Iceberg table names for each of the modes in which the insight gathering can be operated.

Configure Data Fabric to Track User Behavior

The insight service can be enabled to gather insights at the cluster level, type level, and the node level.

If you do not want data from nodes to be copied to Iceberg, you can disable audit log/insights from node. By default, insights are disabled. When insight feature is enabled at the global level, audit logs for all types and all nodes are committed to Iceberg tables periodically. insight commands are available to enable/disable insights based on the type of logs, or at the node level on a node-by-node basis.

IMPORTANT
Install and set up your fabric/cluster before installing the packages for the insight service and enabling the insight service.

You can use tools like Spark and Zeppelin to run queries on the Iceberg tables to generate various reports and charts required by you to detect any anomalies in user behavior related to the data access operations and cluster administration.

You can customize the insights with the insight CLI command.

For instance, you can turn off insight gathering on some nodes (although this is not recommended), or you can turn off insight gathering of certain audit components such as S3 due to heavy S3 traffic on your cluster/fabric.

Enable Insight Gathering

See insight cluster to enable insight gathering.