Manual or Automatic Failover for the ResourceManager
With the manual or automatic failover mechanism, an active ResourceManager and one or more standby ResourceManager processes run in the cluster. The standby ResourceManager nodes run the ResourceManager process without loading the working state. When the active ResourceManager fails, one of the standby ResourceManager nodes can load the working state from the ZooKeeper and continue providing services to the cluster.
ResourceManager clients (HPE Ezmeral Data Fabric client nodes, ApplicationMaster processes, and NodeManager nodes) attempt connections to the ResourceManager nodes in a round-robin fashion until they hit an active ResourceManager node. If the active ResourceManager node is down, ResourceManager clients resume round-robin polling until an active ResourceManager node is detected.
For web requests, including REST API requests, standby ResourceManager nodes automatically redirect web requests to the active ResourceManager node.
Difference between Manual Failover and Automatic Failover
The difference between manual failover and automatic failover is how the transition from standby to active occurs for the ResourceManager process.
- With manual failover, you manually invoke the transition of the ResourceManager from standby to active with the yarn rmadmin command.
- With automatic failover, the ResourceManager processes have an embedded ZooKeeper-based ActiveStandbyElector, which chooses the active ResourceManager. This ActiveStandbyElector detects failures in the currently active ResourceManager and automatically transitions one of the standby ResourceManagers to an active state.
If you specify multiple ResourceManagers when you run configure.sh, automatic failover is configured. However, you can edit the yarn-site.xml file to enable manual failover instead.