ResourceManager High Availability

Provides an overview of how high availability for Resource Manager works.

The ResourceManager service tracks a cluster's resources and schedules YARN applications.

Configure high availability for the ResourceManager so that the failure of the ResourceManager service is not a single point of failure for the cluster. The high availability of ResourceManager is based on the cluster configuration for restart, recovery, and failover.

Restart

The restart settings are configured on the warden.conf file.

By default, the Warden attempts to restart a failed service three times. You can configure the frequency that Warden attempts to restart failed services before initializing failover in the warden.conf file.

Recovery

By default, ResourceManager recovery is enabled and it uses the FileSystemRMStateStore implementation to store the ResourceManager state in the file system.

When a ResourceManager restarts or fails over, the active ResourceManager can recover the state of the previously running ResourceManager. You can configure the ResourceManager to have no recovery. You can also configure the state store implementation that you want to use. For more information, see Recovery for the ResourceManager.

Failover

To configure failover, the cluster must have one or more nodes with the ResourceManager role.

You can select one of the following failover implementations when you use the configure.sh utility to configure each node:

  • Zero Configuration Failover - In this failover mechanism, Warden manages the ResourceManager failover. When the active ResourceManager fails, one of the standby ResourceManager nodes automatically loads the working state from the state store and continues providing services to the cluster. It can be configured with a fresh configure.sh without -RM property in command.

    Zero configuration failover is the default and recommended setting for the following reasons:
    • Only one ResourceManager process consumes cluster resources. With the manual or automatic failover option, the active and standby ResourceManagers consume cluster resources.
    • Warden initiates failover automatically. With the manual failover, you need to manually run the yarn rmadmin command for failover to occur.
    • Simplified clients connectivity. Clients identify the active ResourceManager with a single request to the Zookeeper. With the manual or automatic failover option, ResourceManager clients connect to each ResourceManager in a round-robin fashion until they locate the active ResourceManager; this results in delays when launching or querying jobs.
    • Consistent Configuration. All cluster nodes and clients can use the same yarn-site.xml configuration file. With manual or automatic failover, you must maintain a customized yarn-site.xml file for each node that runs the ResourceManager.

    For information, see Zero Configuration Failover for the ResourceManager. For information on enabling zero configuration failover, see Enabling Zero Configuration Failover for the ResourceManager

  • Manual or Automatic Failover - For information on the manual or automatic failover, see Manual or Automatic Failover for the ResourceManager.

    For information on changing to manual failover from automatic failover, see Manual Failover Administration.

    For information on configuration of automatic failover, see Configuring Automatic Failover for the ResourceManager.

    IMPORTANT
    The ResourceManager configuration properties can be set in yarn-site.xml if you wish to override any of the default values. See ResourceManager Failover Properties and ResourceManager Recovery Properties for property details.