HPE Ezmeral Data Fabric Monitoring Tips and Troubleshooting

Lists the nuances of monitoring clusters.

Monitoring a Secure Cluster

After regenerating the HPE Ezmeral Data Fabric user ticket, service failures occur for collectd and OpenTSDB
If you delete or regenerate the HPE Ezmeral Data Fabric user ticket, the running collectd and OpenTSDB services will fail. After updating the HPE Ezmeral Data Fabric user ticket, restart collectd and OpenTSDB services.

Monitoring Logs

I notice a sudden increase in fluentd logs. What can I do?
A sudden increase in the log file for fluentd could mean that a feedback loop is occurring where fluentd logs an error in the log file for a fluentd issue and that log entry causes yet another error when fluentd tries to parse it. In this case, consider disabling the index of fluentd logs. See Configure Logs to Index.
I see "400 - Rejected by Elasticsearch" messages in the fluentd logs. What can I do?
Messages such as the following can accumulate in the fluentd log when a process does not produce logs with valid UTF-8 output:
2019-04-25 17:00:11 -0700 [warn]: #0 dump an error event: error_class=Fluent::Plugin::Elasticsearch
ErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch" location=nil
after setting this option in es_config.conf
In a message such as the following, you might see invalid characters represented as a diamond with a question mark: . The "service_name":"collectd" part of the message indicates that collectd is generating the invalid UTF-8 output:
[2019-04-30T19:06:29,495][DEBUG][o.e.a.b.TransportShardBulkAction] [mfs73] [mapr_monitoring-2019.05.01][4] failed
to execute bulk item (index) index {[mapr_monitoring-2019.05.01][mapr_monitoringv1][taQkcWoBCeW3tMAsn1cW], 
source[{"my_event_time":"2019-04-30 18:36:39","level":"info","message":"write_maprstreams plugin: Produced: 
Offset: 1247132; Size: 152; [{\"metric\":\"mapr.streams.produce_msgs\",\"value\":448,\"tags\":{\"fqdn\":\
"qa-node91.qa.lab\",\"clusterid\":\"6378079583755418855\",\"clustername\":\"my.cluster.com\"}}]
�\n","@timestamp":"2019-04-30T18:36:39.000000000-07:00","service_name":"collectd"}]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [message] of type [text]
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x5c
One workaround is to comment out the log producing the invalid character. You can do this in the fluentd.conf file. For more information, see Configure Logs to Index.
Another workaround is to fix the application that produces the error message. If the log file is produced by an application that you control, change the output of the log producing the invalid character.

Monitoring Metrics

Where should I store the Elasticsearch index?
Elasticsearch requires a lot of disk space. Also, when you upgrade Elasticsearch, the default index directory is removed along with the package update. Therefore, it is recommended to configure a separate filesystem for the index data. It is not recommended to store index data under the / or the /var filesystem.
NOTE
If you store the Elasticsearch index on a filesystem that is locally hosted, you will be able to access logs in the event that the HPE Ezmeral Data Fabric cluster is not available.
For more information about the Elasticsearch index and the default index directory, see Log Aggregation and Storage.
I see a "Bad Request" error message for my HPE Ezmeral Data Fabric Database metrics? What can I do?
If you have more than 1000 active tables in HPE Ezmeral Data Fabric Database and the HPE Ezmeral Data Fabric monitoring request size to OpenTSDB is more than 4 KB, you may see the following error message:
"Sorry but your request was rejected as being invalid.
The reason provided was: Chunked request not supported."  
You can increase the maximum request size of OpenTSDB to up to 64 KB by setting the following parameters in the opentsdb.conf file:
tsd.http.request.enable_chunked=true
tsd.http.request.max_chunk=65536
For more information, see the OpenTSDB configuration guide.

Installation and Configuration Errors

See Troubleshoot Monitoring Installation Errors