Browse Source

Offline control plane recover (#10660)

* ignore_unreachable for etcd dir cleanup

ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.

* Re-arrange control plane recovery runbook steps

* Remove suggestion to manually update IP addresses

The suggestion was added in 48a182844c 4
years ago. But a new task added 2 years ago, in
ee0f1e9d58, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
pull/10448/head
Yuhao Zhang 10 months ago
committed by GitHub
parent
commit
0e971a37aa
No known key found for this signature in database GPG Key ID: B5690EEEBB952194
2 changed files with 5 additions and 6 deletions
  1. 10
      docs/recover-control-plane.md
  2. 1
      roles/recover_control_plane/etcd/tasks/main.yml

10
docs/recover-control-plane.md

@ -3,11 +3,6 @@
To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
* Backup what you can
* Provision new nodes to replace the broken ones
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
Examples of what broken means in this context:
* One or more bare metal node(s) suffer from unrecoverable hardware failure
@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th
## Runbook
* Backup what you can
* Provision new nodes to replace the broken ones
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Move any broken control plane nodes into the "broken\_kube\_control\_plane" group.
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l
## Caveats
* The playbook has only been tested with fairly small etcd databases.
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
* There may be disruptions while running the playbook.
* There are absolutely no guarantees.

1
roles/recover_control_plane/etcd/tasks/main.yml

@ -39,6 +39,7 @@
delegate_to: "{{ item }}"
with_items: "{{ groups['broken_etcd'] }}"
ignore_errors: true # noqa ignore-errors
ignore_unreachable: true
when:
- groups['broken_etcd']
- has_quorum

Loading…
Cancel
Save