wiki/info/landscape/paris2024/olympic-failover.md
... ...
@@ -0,0 +1,98 @@
1
+# Failover Scenarios for [Olympic Setup Paris 2024](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup)
2
+
3
+This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup). It is work in progress and far from complete.
4
+
5
+## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master
6
+
7
+### Scenario
8
+
9
+The Lenovo P1 device on which the SAP Sailing Analytics Primary Master is running fails and is not available anymore. Reason could be a hardware failure on the CPU, a software deadlock, etc.
10
+
11
+The local secondary master on the second P1 device is still available, also the cloud replicas are still available. New data will arrive on the secondary master through TracAPI and through the WindBot connection. The secondary master writes its wind data to the ``security_service`` replica set.
12
+
13
+The mongodb replicaset member running on the primary P1 may or may not be available anymore, but at this moment this is not too relevant because in this scenario we assume that the primary master process is not working anymore anyhow.
14
+
15
+### Mitigation
16
+
17
+The second P1 needs to switch role to be the ("primary") master. Primarily, the outbound replication channel has to be the cloud rabbit in eu-west-3, and the reverse port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` needs to be changed so it points to ``sap-p1-2:8888``. This requires the port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` to be released, either by ``sap-p1-1`` having died entirely, or by explicitly switching this port forward off.
18
+
19
+```
20
+ sap-p1-1:
21
+ tunnels-no-master # this will release the -R 8888 port forward
22
+ sap-p1-2:
23
+ tunnels-master # this will make the secondary master write to the cloud RabbitMQ
24
+ # to feed all cloud replicas and will map the reverse port forward
25
+ # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888
26
+```
27
+
28
+SwissTiming needs to be informed about the switch immediately and can instantly connect to the "B" instance (secondary master).
29
+
30
+The ``sap-p1-1`` master can then be re-started, either by only re-launching the Java process:
31
+
32
+```
33
+ cd /home/sailing/servers/master
34
+ ./stop
35
+ ./start
36
+```
37
+
38
+When the primary master has recovered, we could in theory switch back. We may, however, as well decide that no additional interruption shall be risked by another switching process and stay with ``sap-p1-2`` until the end of the day. To switch back, do this:
39
+
40
+```
41
+ sap-p1-2:
42
+ tunnels # this will make the secondary master write to the local RabbitMQ,
43
+ # stopping to feed all cloud replicas, and stopping the reverse port forward
44
+ # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888
45
+ sap-p1-1:
46
+ tunnels # this will establish the -R 8888 port forward from paris-ssh.sapsailing.com:8888
47
+ # again to the primary master on sap-p1-1
48
+```
49
+
50
+### Results and procedure tested in Marseille 2023
51
+
52
+Both laptops, ``sap-p1-1`` and ``sap-p1-2`` are both running their primary and secondary master processes, respectively. I tested a primary master failure by running ``tunnels-no-master`` on ``sap-p1-1``, and ``tunnels-master`` on ``sap-p1-2``. Then I made a change on the secondary master (``sap-p1-2``) and found it replicated on the cloud replicas. I then took back the change, and that, too, replicated nicely to the cloud. Then I reverted the tunnels to their original form by invoking ``tunnels`` first on ``sap-p1-2`` to release the reverse port forward from ``8888``, then on ``sap-p1-1``. Making changes afterwards on the secondary master no longer reflected in the cloud, as expected.
53
+
54
+## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica
55
+
56
+### Scenario
57
+
58
+The secondary Lenovo P1 experiences an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database.
59
+
60
+Cloud users are not affected. SwissTiming should be informed about the lack of availability of the "B" system.
61
+
62
+The mongodb replicaset member will not be available anymore.
63
+
64
+### Mitigation
65
+
66
+Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1.
67
+
68
+The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel.
69
+
70
+SSH tunnels won't need to change.
71
+
72
+### Open questions
73
+
74
+How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question.
75
+
76
+What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook.
77
+
78
+## Internet failure on Enoshima site
79
+
80
+### Scenario
81
+
82
+Internet connectivity is not given anymore at Enoshima On-site.
83
+
84
+### Open questions
85
+
86
+How will local/on-site users be connected to the local P1s, assuming that the LAN is still working?
87
+
88
+Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics?
89
+
90
+## TracTrac in the Cloud?
91
+
92
+### Scenario
93
+
94
+On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs?
95
+
96
+### Open questions
97
+
98
+How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0?
wiki/info/landscape/paris2024/olympic-setup.md
... ...
@@ -270,12 +270,14 @@ On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the follow
270 270
271 271
So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-eu-west-3.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance.
272 272
273
-### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com and paris2024-master.sapsailing.com
273
+### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com, paris2024-master.sapsailing.com, and paris2024-secondary-master.sapsailing.com
274 274
275 275
In order to allow us to access ``paris2024.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing
276 276
277 277
```
278 278
/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024.sapsailing.com
279
+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-master.sapsailing.com
280
+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-secondary-master.sapsailing.com
279 281
/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com
280 282
```
281 283
... ...
@@ -291,7 +293,7 @@ server {
291 293
server_name paris2024.sapsailing.com;
292 294
ssl_certificate /etc/ssl/certs/paris2024.sapsailing.com.crt;
293 295
ssl_certificate_key /etc/ssl/private/paris2024.sapsailing.com.key;
294
- ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
296
+ ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
295 297
ssl_ciphers HIGH:!aNULL:!MD5;
296 298
297 299
# set client body size to 100MB
... ...
@@ -360,6 +362,8 @@ The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.prin
360 362
361 363
The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``.
362 364
365
+On ``sap-p1-2`` we run a script ``compare-secondary-to-primary-master`` every five minutes which basically does a ``compareServers -ael`` which uses the REST API for comparing server contents. If a difference is reported by the tool then an e-mail notification is sent out to the list of operators.
366
+
363 367
### Time Synchronizing
364 368
Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added
365 369
```