Collecting Volume Metrics

Describes how to enable and collect operational metrics for volumes.

You can collect the following volume metrics on file system after you enable metrics collection as described in Enabling Volume Metric Collection:

Read I/Os Number of reads
Write I/Os Number of writes
Read throughput Amount of data read
Write throughput Amount of data written
Read latency Time taken by read operations
Write latency Time taken by write operations

If you enable metrics collection on a volume, for each file system instance, on every node where the volume containers reside, metrics for the volume (for a day) are captured every 10 seconds and logged to files in a local volume, which is two way replicated. The metrics log file, Metrics.log-<date>-<n>.json, is available at /var/mapr/local/<hostname>/audit/<mfs-port> directory. Here:

  • mfs-port is the port on which the file system instance listens.
  • date is the record date in the format yyyy-mm-dd. A new file is created at the beginning of each day.
  • n is the iteration of the log file represented by 3 digits. A new file is created every time Warden is restarted on the node. For the first file, <n> is 001 and <n> is incremented every time warden restarts. For example: Metrics.log-2017-08-18-001.json and Metrics.log-2017-08-18-002.json When a new file is created, the old file is purged based on the CLDB audit log retention period, which is 30 days by default.

Each record in the file looks similar to the following:

{
 "ts":1503048590000,
  "vid":35211529,
  "RDT":0.0,
  "RDL":0.0,
  "RDO":0.0,
  "WRT":308085.8,
  "WRL":2434.0,
  "WRO":2192.0
}

Here:

  • ts — timestamp in milliseconds
  • vid — volume ID
  • RDT — read throughput in KB (cumulative for 10 seconds)
  • RDL — amount of time taken by read operations (average for 10 seconds)
  • RDO — number of read operations (cumulative for 10 seconds)
  • WRT — write throughput in KB (cumulative for 10 seconds)
  • WRL — amount of time taken by write operations (average for 10 seconds)
  • WRO — number of write operations (cumulative for 10 seconds)

The collectd service reads up to 16 MB of data every ten seconds from each file, then aggregates and writes one minute worth of data to OpenTSDB. When reading the file, the collectd service stores offsets (as to how much has been read) as extended attributes (trusted.disptachedOffset) on the file. In addition to the default tags assigned to each metric when collectd writes metrics to OpenTSDB, the following tags are assigned to volume metrics:

  • mapr.volmetrics.[read_|write_][throughput|latency|ops] — Displays the type of metric
  • volume_name — Displays the name of the volume

For more information on the default tags, see Metric Collection.

For each metrics file, Data Fabric also creates a file, Vollist_metrics.log-<date>-<n>, in the /var/mapr/local/<hostname>/audit/<mfs-port>/ directory. This file is purged based on the CLDB audit log retention period, which is 30 days by default. This file contains a comma separated list of volume name and volume ID (for volumes for which metrics are captured) and is associated with the Metrics.log-<date><n>.json file. For example, the record in the file looks similar to the following:

<volumeid>,<volume name>,<volumeid>,<volume name>,...

You can visualize the metrics in the dashboards on Grafana. See Metric Visualization for more information.