Quorum Service Failure Tolerance

For zero-downtime rolling-updates, services need to maintain a quorum while nodes are being added and removed. When a  node is added, the total member count and quorum size must be increased to avoid a potential split brain upon a network partitioning scenario if a failure or extended delay were to occur during a rolling upgrade. A complex service with many nodes may be continuously in a state of rolling-updates to upgraded OS or foundation software, add security patches, or deploy application bug fixes or new features.

Assume we want a service to always be able to tolerate two node failures before halting for reliability/availability. From the table below, the minimum is 5 initial member nodes. With less than 5 nodes two node failures cause service updates to halt in order to guarantee consistency. 

When performing a rolling update a new node is added. When the new node is synchronized and brought into the quorum, the total member node count increases by one. This increases the node failure tolerance if the total member node count was odd. At this point, the node being replaced needs to be terminated followed by reducing the total member node count so the correct node failure tolerance is accurately reflected.

ZooKeeper versions prior to 3.5.0-ALPHA do not support dynamic reconfiguration. The ZooKeeper configuration files need to be manually updated and all nodes restarted after each node update occurs.





Comments

Popular posts from this blog

Sites, Newsletters, and Blogs

Oracle JDBC ReadTimeout QueryTimeout

Locks held on Oracle for hours after sessions abnormally terminated by node failure