Elasticsearch Issues
This article describes the following Elasticsearch troubleshooting procedures:
- Elasticsearch Architecture
- Querying Elasticsearch
- Unstable Elasticsearch Service
- Cleaning up Elasticsearch Indices
Elasticsearch Architecture
HPE Ezmeral Runtime Enterprise generates the following performance metrics
at short time intervals from the Docker stats
API and
cgroup
data:
- Memory usage
- CPU load
- Network throughput in both Docker containers and on Worker hosts.
This information:
- Populates the HPE Ezmeral Runtime Enterprise Dashboard screens. If these screens display current data that is being constantly refreshed, then Elasticsearch monitoring is functioning correctly.
The Metricbeat service runs in the epic-monitoring
containers on
each Controller, Kubernetes Worker, and EPIC Worker host. This
service collects the metrics and forwards them to the Elasticsearch
database. For deployments without Platform HA (single Controller host), the
Elasticsearch database is a single-node cluster hosted only on the Controller host.
For deployments with Platform HA enabled, Elasticsearch runs as a 3-node cluster
across the Primary Controller, Shadow Controller, and Arbiter hosts. In
this case, the Elasticsearch master is chosen using the standard master
selection process and does not necessarily reside on the Primary Controller
host.
The Elasticsearch service is containerized; however, the database and logs are stored
in /var/lib/monitoring
on the underlying physical host(s) that
make up the Elasticsearch cluster. Verify that this directory has enough disk space
on each host.
If the Dashboard screen are not displaying current, updated data, then the monitoring stack is either degrading or has failed. To check this:
- Open the Services tab of the Platform Administrator Dashboard screen (see Dashboard - Platform Administrator).
- In the BlueData section of the screen, look for the two Monitoring
columns.
- The left-hand column displays the status of the
epic-
monitoring
collector service running on each host (Metricbeat). - The right-hand column is for the monitoring database (Elasticsearch cluster). If Platform HA has been enabled, then three dots appear in this column (one for each node in the Elasticsearch cluster).
- The left-hand column displays the status of the
Querying Elasticsearch
The Elasticsearch service listens on port 9210 of its host nodes. (This is different than the default ELasticsearch port 9000.) Authentication is provided by the SearchGuard service. You must therefore supply a username and password with all your queries. Execute the following commands to obtain the username and password on an HPE Ezmeral Runtime Enterprise host:
bdconfig --getvalue bdshared_elasticsearch_admin
bdconfig --getvalue bdshared_elasticsearch_adminpass
You may now query the database to verify that the Elasticsearch service is listening.
Unstable Elasticsearch Service
An insufficient Java memory heap can cause stability issues that may bring down the Monitoring Database service. These errors will appear as follows:
- The graphs in the Dashboard screen will be empty, except that a spinning icon may appear where the graphs would be.
-
The MONITORING DATABASE section of the Services tab of the Platform Administrator Dashboard screen may display red dots.
To confirm the issue, check /var/lib/monitoring/logs/hpecp-monitoring.log
for the following
errors:
Caused by: java.lang.OutOfMemoryError: Java heap space
[2018-05-30T16:45:39,564][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[CJ_O7Il][fetch_shard_store][T#49]], exiting
java.lang.OutOfMemoryError: Java heap space
[2018-05-30T16:45:39,579][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[CJ_O7Il][bulk][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
The following procedure will increase the size of the Java heap allocation in each of the monitoring containers housing the Elasticsearch service. You must perform this procedure on the following host(s):
- When platform HA is not enabled: Controller host only.
- When platform HA is enabled: Controller, Shadow Controller, and Arbiter hosts.
To increase the size of the Elasticsearch Java heap:
- SSH into the physical host.
-
Find the ID of the monitoring container by executing the command
# docker ps -a
.You will see a result similar to the following:
CONTAINER ID IMAGE 2101ffa232f3 epic/monitoring:1.1
- Access the monitoring container by executing the command
# docker exec -it <container_id> bash
-
Modify the
jvm.options
file by expanding the Java memory heap to 4GB (near Line 22):# vi /etc/elasticsearch/jvm.options -Xms4g -Xmx4g
- Save the file and quit.
- Press [CTRL]+[D] to detach from the container.
-
Restart Elasticsearch by executing the following command:
# /opt/bluedata/bundles/<epic install bin folder>/startscript.sh --action enable_monitoring
This procedure should resolve any Elasticsearch stability issues caused by the Java heap size.
Cleaning up Elasticsearch Indices
This process applies if you need to clean-up Elasticsearch indices and restore then monitoring service by refreshing a crashed Elasticsearch instance, deleting the indices data, and restoring monitoring. In this section, the generic terms host and hosts refer to the Controller and, if platform HA is enabled, the Shadow Controller and Arbiter.
-
Execute the following command on the Controller host and, if platform HA is enabled, the Shadow Controller and Arbiter hosts:
systemctl stop bds-monitoring
-
Remove the monitoring container on each host by executing the following command on each host:
docker rm -f HPE Ezmeral Runtime Enterprise-monitoring-<host-ip>
-
Check for dangling processes on each host:
ps -ef | grep 'elasticsearch\|filebeat\|metricbeat\|devcron\|supervisord'
The result should show only two processes running. For example:
root 4127 4053 0 Apr06 ? 00:00:01 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisord.conf root 15218 13147 0 00:37 pts/2 00:00:00 grep --color=auto elasticsearch\|filebeat\|metricbeat\|devcron\|supervisord
If processes are still running, then reboot the host.NOTEIf platform HA is enabled, then be sure to reboot the Arbiter host first, then the Shadow Controller host, and then (once all HA services are back up) the Primary Controller host. You may want to execute the commandbdconfig --hafailover
on the current Primary Controller host to fail-back to the original primary/shadow configuration. -
Delete the Elasticsearch indices directory on the Controller host and, if platform HA is enabled, the Shadow Controller and Arbiter hosts by executing the following command:
rm -rf /var/lib/monitoring/elasticsearch/nodes
-
On the Primary Controller host, restart monitoring by executing the following command:
# /opt/bluedata/bundles/<epic install bin folder>/startscript.sh --action enable_monitoring<version-build></version-build>
-
Verify that Elasticsearch is running by querying the indices on the Controller host and, if platform HA is enabled, on the Shadow Controller and Arbiter hosts by executing the following command:
curl -u elastic:$(bdconfig --getvalue bdshared_elasticsearch_adminpass) https://localhost:9210/_cat/indices
The output should look like this:
green open metricbeat-6.6.1-2020.03.10 VUXBNUmnRg-56PyvrWgZEw 5 1 78030 0 36.3mb 18mb green open nvidiagpubeat-6.5.5-2020.03.11 gVixDuwdS1q1mmTiNorGIQ 5 1 678 0 818.9kb 409.4kb green open nvidiagpubeat-6.5.5-2020.03.10 kyxwozAYTqacasI50irMqQ 5 1 70 0 402kb 201kb green open metricbeat-6.6.3-2020.03.10 B7TCuV7qRMuOffO7IgIhCA 5 1 86490 0 94.1mb 47.4mb green open metricbeat-6.6.1-2020.03.11 Jm9yQe1gQTyXATMMU3Ez_w 5 1 739983 0 324.9mb 168.1mb green open bdlogging_v1-6.6.1-2020.03.11 K7A2iEQyS2GtJH8u0n15Aw 5 1 6799977 0 7.5gb 3.6gb green open metricbeat-6.6.3-2020.03.11 PlxsbXABQEiETLCG04ewLg 5 1 382342 0 84.6mb 193.5mb green open searchguard U _dvWvihTHmqAR5Kxw3F6UVA 1 2 5 0 117.2kb 39.9kb green open .kibana_1 Vy567V6US-Sp-JHUi9AE7A 1 1 3 0 29.1kb 14.5kb green open bdlogging_v1-6.6.1-2020.03.10 tAR-jf0eS-uOgWlMbWGINw 5 1 1656180 0 1.6gb 851.1mb