Spyglass on Streams

Release 6.0 of the HPE Ezmeral Data Fabric introduced Spyglass on Streams. When you install release 6.0 or later, Streams is the default mechanism through which metrics flow from the Collectd service to OpenTSDB. Moving metrics through streams secures the data and provides a mechanism to perform real-time data analytics.

NOTE
Currently, Spyglass on Streams is not available for logs. Fluentd continues to use the REST API to send logs to ElasticSearch for the indexing of logs.

The Flow of Metrics via Streams

The Collectd service collects node-level and service-level metrics from each node in the cluster. The Collectd service hashes metrics to a stream and writes the metrics into topics in that stream.

In release 6.1.0 and later, Collectd creates one stream per cluster: /var/mapr/mapr.monitoring/metricstreams/0. Topic names use the format <hostname>. For example: mfs81.qa.lab.

The Streams server distributes metrics to the available OpenTSDB nodes, and OpenTSDB consumes the metrics.
NOTE
Writing to an external OpenTSDB is not supported from release 6.0 onwards. In addition, inserting non-data-fabric data into the provided OpenTSDB is not supported. Any custom data added to the provided OpenTSDB will be removed by the purge script (tsdb_cluster_mgmt.sh) that runs periodically.
NOTE
Dedicated metrics volumes should be appropriately paired with a topology that has OpenTSDB nodes. Change topology accordingly if OpenTSDB begins using more than 10% of the file-system write capacity.

Determining How Many OpenTSDB Nodes to Install

Having multiple OpenTSDB nodes in the cluster distributes the workload. The number of partitions and OpenTSDB nodes determines the level of parallelism for consumption.

Each OpenTSDB node can consume one partition at a time. By default, metrics data is divided across 12 partitions in each topic and optimal parallelism is reached if there are five OpenTSDB nodes to consume the partitions. See Parallelism When Consuming Messages. Note that the term “consumer” in the topic equates to an OpenTSDB node in Spyglass on Streams.

A general guideline for the minimum number of OpenTSDB nodes in a cluster is one for every 10x increase in nodes beyond 10, for example:
  • Three OpenTSDB nodes in a 10-node cluster
  • Four OpenTSDB nodes in a 100-node cluster
  • Five OpenTSDB nodes in a 1000-node cluster

If your cluster has 10 or more nodes, at least three OpenTSDB nodes should be available to consume metrics. Typically, for every 10x increase in nodes, you should add another OpenTSDB node. For example, if your cluster reaches a size of 100 nodes, have four OpenTSDB nodes available for consumption.

These guidelines do not guarantee optimal performance. Evaluate the performance of the streams to determine if your cluster would benefit from additional OpenTSDB nodes.
NOTE
If all configured OpenTSDB nodes have been offline for several hours, you may notice an initial spike in memory and CPU usage by OpenTSDB processes as they aggressively try to keep up with the metrics. You can reduce the number of AsynchHBase threads to reduce the CPU and memory usage. By default, AsynchHBase runs 128 threads. To modify the number of threads, add or modify the following property in the /opt/mapr/asynchbase/asynchbase-<version>/conf/asynchbase.conf file on the OpenTSDB nodes:
"fs.mapr.async.worker.threads=<value>"

Increasing the Number of Streams

For release 6.1 and later, the default setting for the number of streams is one. Even if your cluster grows to 1000 nodes or more, you do not need to increase the number of streams. For release 6.0.x, increasing the number of streams is recommended as you add more nodes (see the release 6.0 documentation), but this practice is not required in release 6.1 and later.

Changing the Automatic Stream Cursor Commits

You can adjust the frequency of automatic stream cursor commits for OpenTSDB. Modify the tsd.streams.autocommit.interval in opentsdb.conf The unit is thousands of seconds. The default value is '60000' which is 60 secs. For a system with heavy loads, consider changing the value to something like 5 minutes.