6333af64eedc14e85a50f004a11e0a71fbbae151
wiki/info/landscape/paris2024/olympic-failover.md
| ... | ... | @@ -0,0 +1,98 @@ |
| 1 | +# Failover Scenarios for [Olympic Setup Paris 2024](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup) |
|
| 2 | + |
|
| 3 | +This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup). It is work in progress and far from complete. |
|
| 4 | + |
|
| 5 | +## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master |
|
| 6 | + |
|
| 7 | +### Scenario |
|
| 8 | + |
|
| 9 | +The Lenovo P1 device on which the SAP Sailing Analytics Primary Master is running fails and is not available anymore. Reason could be a hardware failure on the CPU, a software deadlock, etc. |
|
| 10 | + |
|
| 11 | +The local secondary master on the second P1 device is still available, also the cloud replicas are still available. New data will arrive on the secondary master through TracAPI and through the WindBot connection. The secondary master writes its wind data to the ``security_service`` replica set. |
|
| 12 | + |
|
| 13 | +The mongodb replicaset member running on the primary P1 may or may not be available anymore, but at this moment this is not too relevant because in this scenario we assume that the primary master process is not working anymore anyhow. |
|
| 14 | + |
|
| 15 | +### Mitigation |
|
| 16 | + |
|
| 17 | +The second P1 needs to switch role to be the ("primary") master. Primarily, the outbound replication channel has to be the cloud rabbit in eu-west-3, and the reverse port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` needs to be changed so it points to ``sap-p1-2:8888``. This requires the port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` to be released, either by ``sap-p1-1`` having died entirely, or by explicitly switching this port forward off. |
|
| 18 | + |
|
| 19 | +``` |
|
| 20 | + sap-p1-1: |
|
| 21 | + tunnels-no-master # this will release the -R 8888 port forward |
|
| 22 | + sap-p1-2: |
|
| 23 | + tunnels-master # this will make the secondary master write to the cloud RabbitMQ |
|
| 24 | + # to feed all cloud replicas and will map the reverse port forward |
|
| 25 | + # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888 |
|
| 26 | +``` |
|
| 27 | + |
|
| 28 | +SwissTiming needs to be informed about the switch immediately and can instantly connect to the "B" instance (secondary master). |
|
| 29 | + |
|
| 30 | +The ``sap-p1-1`` master can then be re-started, either by only re-launching the Java process: |
|
| 31 | + |
|
| 32 | +``` |
|
| 33 | + cd /home/sailing/servers/master |
|
| 34 | + ./stop |
|
| 35 | + ./start |
|
| 36 | +``` |
|
| 37 | + |
|
| 38 | +When the primary master has recovered, we could in theory switch back. We may, however, as well decide that no additional interruption shall be risked by another switching process and stay with ``sap-p1-2`` until the end of the day. To switch back, do this: |
|
| 39 | + |
|
| 40 | +``` |
|
| 41 | + sap-p1-2: |
|
| 42 | + tunnels # this will make the secondary master write to the local RabbitMQ, |
|
| 43 | + # stopping to feed all cloud replicas, and stopping the reverse port forward |
|
| 44 | + # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888 |
|
| 45 | + sap-p1-1: |
|
| 46 | + tunnels # this will establish the -R 8888 port forward from paris-ssh.sapsailing.com:8888 |
|
| 47 | + # again to the primary master on sap-p1-1 |
|
| 48 | +``` |
|
| 49 | + |
|
| 50 | +### Results and procedure tested in Marseille 2023 |
|
| 51 | + |
|
| 52 | +Both laptops, ``sap-p1-1`` and ``sap-p1-2`` are both running their primary and secondary master processes, respectively. I tested a primary master failure by running ``tunnels-no-master`` on ``sap-p1-1``, and ``tunnels-master`` on ``sap-p1-2``. Then I made a change on the secondary master (``sap-p1-2``) and found it replicated on the cloud replicas. I then took back the change, and that, too, replicated nicely to the cloud. Then I reverted the tunnels to their original form by invoking ``tunnels`` first on ``sap-p1-2`` to release the reverse port forward from ``8888``, then on ``sap-p1-1``. Making changes afterwards on the secondary master no longer reflected in the cloud, as expected. |
|
| 53 | + |
|
| 54 | +## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica |
|
| 55 | + |
|
| 56 | +### Scenario |
|
| 57 | + |
|
| 58 | +The secondary Lenovo P1 experiences an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database. |
|
| 59 | + |
|
| 60 | +Cloud users are not affected. SwissTiming should be informed about the lack of availability of the "B" system. |
|
| 61 | + |
|
| 62 | +The mongodb replicaset member will not be available anymore. |
|
| 63 | + |
|
| 64 | +### Mitigation |
|
| 65 | + |
|
| 66 | +Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1. |
|
| 67 | + |
|
| 68 | +The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel. |
|
| 69 | + |
|
| 70 | +SSH tunnels won't need to change. |
|
| 71 | + |
|
| 72 | +### Open questions |
|
| 73 | + |
|
| 74 | +How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question. |
|
| 75 | + |
|
| 76 | +What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook. |
|
| 77 | + |
|
| 78 | +## Internet failure on Enoshima site |
|
| 79 | + |
|
| 80 | +### Scenario |
|
| 81 | + |
|
| 82 | +Internet connectivity is not given anymore at Enoshima On-site. |
|
| 83 | + |
|
| 84 | +### Open questions |
|
| 85 | + |
|
| 86 | +How will local/on-site users be connected to the local P1s, assuming that the LAN is still working? |
|
| 87 | + |
|
| 88 | +Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics? |
|
| 89 | + |
|
| 90 | +## TracTrac in the Cloud? |
|
| 91 | + |
|
| 92 | +### Scenario |
|
| 93 | + |
|
| 94 | +On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs? |
|
| 95 | + |
|
| 96 | +### Open questions |
|
| 97 | + |
|
| 98 | +How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0? |
wiki/info/landscape/paris2024/olympic-setup.md
| ... | ... | @@ -270,12 +270,14 @@ On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the follow |
| 270 | 270 | |
| 271 | 271 | So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-eu-west-3.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance. |
| 272 | 272 | |
| 273 | -### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com and paris2024-master.sapsailing.com |
|
| 273 | +### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com, paris2024-master.sapsailing.com, and paris2024-secondary-master.sapsailing.com |
|
| 274 | 274 | |
| 275 | 275 | In order to allow us to access ``paris2024.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing |
| 276 | 276 | |
| 277 | 277 | ``` |
| 278 | 278 | /usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024.sapsailing.com |
| 279 | +/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-master.sapsailing.com |
|
| 280 | +/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-secondary-master.sapsailing.com |
|
| 279 | 281 | /usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com |
| 280 | 282 | ``` |
| 281 | 283 | |
| ... | ... | @@ -291,7 +293,7 @@ server { |
| 291 | 293 | server_name paris2024.sapsailing.com; |
| 292 | 294 | ssl_certificate /etc/ssl/certs/paris2024.sapsailing.com.crt; |
| 293 | 295 | ssl_certificate_key /etc/ssl/private/paris2024.sapsailing.com.key; |
| 294 | - ssl_protocols TLSv1 TLSv1.1 TLSv1.2; |
|
| 296 | + ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3; |
|
| 295 | 297 | ssl_ciphers HIGH:!aNULL:!MD5; |
| 296 | 298 | |
| 297 | 299 | # set client body size to 100MB |
| ... | ... | @@ -360,6 +362,8 @@ The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.prin |
| 360 | 362 | |
| 361 | 363 | The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``. |
| 362 | 364 | |
| 365 | +On ``sap-p1-2`` we run a script ``compare-secondary-to-primary-master`` every five minutes which basically does a ``compareServers -ael`` which uses the REST API for comparing server contents. If a difference is reported by the tool then an e-mail notification is sent out to the list of operators. |
|
| 366 | + |
|
| 363 | 367 | ### Time Synchronizing |
| 364 | 368 | Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added |
| 365 | 369 | ``` |