Configuring Data Fabric to Track User Behavior

Describes how to configure Data Fabric to be able to track user behavior.

Introduction

When auditing is enabled in Data Fabric, files, streams, S3 objects, and tables can be audited for cluster administration and/or data access operations. See Enabling and Disabling Auditing of Cluster Administration to enable auditing for cluster administration.

See Streaming Audit Logs for information on audit log streaming. See Enabling and Disabling Audit Streaming Using the CLI to enable audit streaming.

Data Fabric generates audit logs for the services that are related to various Data Fabric components. The services include CLDB, S3, MFS, auth. Auth logs are authentication audit logs.

Auditing is useful to record user behavior and assists in tracking anomalies or potential data security threats with respect to Data Fabric. See Auditing in Data Fabric for information on auditing.

See Log files on information about the various log files generated by Data Fabric.

See Viewing Audit Logs for information on how to view audit logs in Data Fabric.

Data Fabric audit logs provide insights into the activity that has taken place in relation to a cluster.

The audit logs are stored on nodes on which the respective service runs. This makes it cumbersome to establish a correlation between the various logs.

Data Fabric stores audit logs in files and the audit logs can be directed to streams. However, it is not possible to run queries on streams on Data Fabric. Hence, the stream data is directed to and is stored on Apache Iceberg tables. There are various tools like Apache Spark and Apache Zeppelin. Apache Spark can be used to run queries on the audit log data stored on Iceberg, whereas Apache Zeppelin can be used to provide graphical insights. Apache Zeppelin can make use of customizable queries to generate dashboards.

You can write applications that consume the audit log data stored on Iceberg to detect anomalies in user behavior.

The insight service can be installed from mapr-insight package available on the HPE website that hosts the MEP packages.

The insight service uses Hive Metastore to store and manage the Iceberg catalog. Hive Metastore must be accessible to insights service for storing audit logs in Iceberg table. You must have the mapr-hivemetastore package installed and configured on your cluster to be able to use the insight service.

Hive Metastore requires a relational database management system like MySQL in production setups. See Using MySQL for the Hive Metastore to use MySQL with Hive Metastore.

To set up MySQL to work with the Hive Metastore and Data Fabric, see Configuring a Remote MySQL Database for Hive Metastore.

Data Fabricsupports other production grade databases like PostgreSQL. See Configuring Data Fabric for Hive Metastore for details.

Types of Audit Logs Copied for Analysis

The following audit logs are expanded using a variant of expand-audit utility to include user-friendlier versions of uids, volids, etc. (usernames, volume names, etc.) before committing the logs to Iceberg table.

  • CLDB logs
  • MFS logs
  • authentication logs
  • S3 logs

There are four distinct Iceberg tables designated for each type of audit log stream.

Configure Data Fabric to Track User Behavior

The insight service can be enabled to gather insights at the cluster level, type level, and the node level. The insights service runs on a single node only.

If you do not want data from nodes to be copied to Iceberg, you can disable audit log/insights from node. By default, insights are disabled. When insight feature is enabled, audit logs for all types and all nodes are committed to iceberg tables periodically. Insight commands are available to enable/disable insights based on the type of logs, or on a node-by-node basis.

IMPORTANT
Install and set up your cluster before installing the packages for the insights service and enabling the insights service. The logs are generated and these logs can be converted to streams so that the stream data can be copied to Apache Iceberg tables.

You can use tools like Spark and Zeppelin to run queries on the Iceberg tables to generate various reports and charts required by you to detect any anomalies in user behavior related to the data access operations and cluster administration.

You can customize the insights with the insight CLI command.