Elasticsearch Issues

This article describes the following Elasticsearch troubleshooting procedures:

Elasticsearch Architecture
Querying Elasticsearch
Unstable Elasticsearch Service
Cleaning up Elasticsearch Indices

Elasticsearch Architecture

HPE Ezmeral Runtime Enterprise generates the following performance metrics at short time intervals from the Docker stats API and cgroup data:

Memory usage
CPU load
Network throughput in both Docker containers and on Worker hosts.

This information:

Populates the HPE Ezmeral Runtime Enterprise Dashboard screens. If these screens display current data that is being constantly refreshed, then Elasticsearch monitoring is functioning correctly.

The Metricbeat service runs in the epic-monitoring containers on each Controller, Kubernetes Worker, and EPIC Worker host. This service collects the metrics and forwards them to the Elasticsearch database. For deployments without Platform HA (single Controller host), the Elasticsearch database is a single-node cluster hosted only on the Controller host. For deployments with Platform HA enabled, Elasticsearch runs as a 3-node cluster across the Primary Controller, Shadow Controller, and Arbiter hosts. In this case, the Elasticsearch master is chosen using the standard master selection process and does not necessarily reside on the Primary Controller host.

The Elasticsearch service is containerized; however, the database and logs are stored in /var/lib/monitoring on the underlying physical host(s) that make up the Elasticsearch cluster. Verify that this directory has enough disk space on each host.

If the Dashboard screen are not displaying current, updated data, then the monitoring stack is either degrading or has failed. To check this:

Open the Services tab of the Platform Administrator Dashboard screen (see Dashboard - Platform Administrator).
In the BlueData section of the screen, look for the two Monitoring columns.
- The left-hand column displays the status of the epic-monitoring collector service running on each host (Metricbeat).
- The right-hand column is for the monitoring database (Elasticsearch cluster). If Platform HA has been enabled, then three dots appear in this column (one for each node in the Elasticsearch cluster).

Querying Elasticsearch

The Elasticsearch service listens on port 9210 of its host nodes. (This is different than the default ELasticsearch port 9000.) Authentication is provided by the SearchGuard service. You must therefore supply a username and password with all your queries. Execute the following commands to obtain the username and password on an HPE Ezmeral Runtime Enterprise host:

bdconfig --getvalue bdshared_elasticsearch_admin 
bdconfig --getvalue bdshared_elasticsearch_adminpass

You may now query the database to verify that the Elasticsearch service is listening.

Unstable Elasticsearch Service

An insufficient Java memory heap can cause stability issues that may bring down the Monitoring Database service. These errors will appear as follows:

The graphs in the Dashboard screen will be empty, except that a spinning icon may appear where the graphs would be.
The MONITORING DATABASE section of the Services tab of the Platform Administrator Dashboard screen may display red dots.

To confirm the issue, check /var/lib/monitoring/logs/hpecp-monitoring.log for the following errors:

Caused by: java.lang.OutOfMemoryError: Java heap space
[2018-05-30T16:45:39,564][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[CJ_O7Il][fetch_shard_store][T#49]], exiting
java.lang.OutOfMemoryError: Java heap space
[2018-05-30T16:45:39,579][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[CJ_O7Il][bulk][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space

The following procedure will increase the size of the Java heap allocation in each of the monitoring containers housing the Elasticsearch service. You must perform this procedure on the following host(s):

When platform HA is not enabled: Controller host only.
When platform HA is enabled: Controller, Shadow Controller, and Arbiter hosts.

To increase the size of the Elasticsearch Java heap:

SSH into the physical host.
Find the ID of the monitoring container by executing the command # docker ps -a.

You will see a result similar to the following:
```
CONTAINER ID          IMAGE
        2101ffa232f3          epic/monitoring:1.1
```
Access the monitoring container by executing the command # docker exec -it <container_id> bash
Modify the jvm.options file by expanding the Java memory heap to 4GB (near Line 22):
```
# vi /etc/elasticsearch/jvm.options
        -Xms4g
        -Xmx4g
```
Save the file and quit.
Press [CTRL]+[D] to detach from the container.

Restart Elasticsearch by executing the following command:

# /opt/bluedata/bundles/<epic install bin folder>/startscript.sh --action enable_monitoring

This procedure should resolve any Elasticsearch stability issues caused by the Java heap size.

Cleaning up Elasticsearch Indices

This process applies if you need to clean-up Elasticsearch indices and restore then monitoring service by refreshing a crashed Elasticsearch instance, deleting the indices data, and restoring monitoring. In this section, the generic terms host and hosts refer to the Controller and, if platform HA is enabled, the Shadow Controller and Arbiter.

NOTE This process will delete all of the metrics data stored in Elasticsearch.

Execute the following command on the Controller host and, if platform HA is enabled, the Shadow Controller and Arbiter hosts:
```
systemctl stop bds-monitoring
```
Remove the monitoring container on each host by executing the following command on each host:
```
docker rm -f HPE Ezmeral Runtime Enterprise-monitoring-<host-ip>
```
Check for dangling processes on each host:
```
ps -ef | grep 'elasticsearch\|filebeat\|metricbeat\|devcron\|supervisord'
```
The result should show only two processes running. For example:
```
root      4127  4053  0 Apr06 ?        00:00:01 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisord.conf
        root     15218 13147  0 00:37 pts/2    00:00:00 grep --color=auto elasticsearch\|filebeat\|metricbeat\|devcron\|supervisord
```
If processes are still running, then reboot the host.
NOTE If platform HA is enabled, then be sure to reboot the Arbiter host first, then the Shadow Controller host, and then (once all HA services are back up) the Primary Controller host. You may want to execute the command bdconfig --hafailover on the current Primary Controller host to fail-back to the original primary/shadow configuration.
Delete the Elasticsearch indices directory on the Controller host and, if platform HA is enabled, the Shadow Controller and Arbiter hosts by executing the following command:
```
rm -rf /var/lib/monitoring/elasticsearch/nodes
```

On the Primary Controller host, restart monitoring by executing the following command:

# /opt/bluedata/bundles/<epic install bin folder>/startscript.sh --action enable_monitoring<version-build></version-build>

Verify that Elasticsearch is running by querying the indices on the Controller host and, if platform HA is enabled, on the Shadow Controller and Arbiter hosts by executing the following command:

curl -u elastic:$(bdconfig --getvalue bdshared_elasticsearch_adminpass) https://localhost:9210/_cat/indices

The output should look like this:

green open metricbeat-6.6.1-2020.03.10    VUXBNUmnRg-56PyvrWgZEw 5 1   78030 0  36.3mb    18mb
        green open nvidiagpubeat-6.5.5-2020.03.11 gVixDuwdS1q1mmTiNorGIQ 5 1     678 0 818.9kb 409.4kb
        green open nvidiagpubeat-6.5.5-2020.03.10 kyxwozAYTqacasI50irMqQ 5 1      70 0   402kb   201kb
        green open metricbeat-6.6.3-2020.03.10    B7TCuV7qRMuOffO7IgIhCA 5 1   86490 0  94.1mb  47.4mb
        green open metricbeat-6.6.1-2020.03.11    Jm9yQe1gQTyXATMMU3Ez_w 5 1  739983 0 324.9mb 168.1mb
        green open bdlogging_v1-6.6.1-2020.03.11  K7A2iEQyS2GtJH8u0n15Aw 5 1 6799977 0   7.5gb   3.6gb
        green open metricbeat-6.6.3-2020.03.11    PlxsbXABQEiETLCG04ewLg 5 1  382342 0  84.6mb 193.5mb
        green open searchguard U                 _dvWvihTHmqAR5Kxw3F6UVA 1 2       5 0 117.2kb  39.9kb
        green open .kibana_1                      Vy567V6US-Sp-JHUi9AE7A 1 1       3 0  29.1kb  14.5kb
        green open bdlogging_v1-6.6.1-2020.03.10  tAR-jf0eS-uOgWlMbWGINw 5 1 1656180 0   1.6gb 851.1mb