Changes in 6333af6: more paris2024 Wiki write-up

				wiki/info/landscape/paris2024/olympic-failover.md
			
          @@ -0,0 +1,98 @@

          +# Failover Scenarios for [Olympic Setup Paris 2024](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup)

          +

          +This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup). It is work in progress and far from complete.

          +

          +## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master

          +

          +### Scenario

          +

          +The Lenovo P1 device on which the SAP Sailing Analytics Primary Master is running fails and is not available anymore. Reason could be a hardware failure on the CPU, a software deadlock, etc.

          +

          +The local secondary master on the second P1 device is still available, also the cloud replicas are still available. New data will arrive on the secondary master through TracAPI and through the WindBot connection. The secondary master writes its wind data to the ``security_service`` replica set.

          +

          +The mongodb replicaset member running on the primary P1 may or may not be available anymore, but at this moment this is not too relevant because in this scenario we assume that the primary master process is not working anymore anyhow.

          +

          +### Mitigation

          +

          +The second P1 needs to switch role to be the ("primary") master. Primarily, the outbound replication channel has to be the cloud rabbit in eu-west-3, and the reverse port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` needs to be changed so it points to ``sap-p1-2:8888``. This requires the port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` to be released, either by ``sap-p1-1`` having died entirely, or by explicitly switching this port forward off.

          +

          +```

          +  sap-p1-1:

          +    tunnels-no-master      # this will release the -R 8888 port forward

          +  sap-p1-2:

          +    tunnels-master         # this will make the secondary master write to the cloud RabbitMQ

          +                           # to feed all cloud replicas and will map the reverse port forward

          +                           # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888

          +```

          +

          +SwissTiming needs to be informed about the switch immediately and can instantly connect to the "B" instance (secondary master).

          +

          +The ``sap-p1-1`` master can then be re-started, either by only re-launching the Java process:

          +

          +```

          +  cd /home/sailing/servers/master

          +  ./stop

          +  ./start

          +```

          +

          +When the primary master has recovered, we could in theory switch back. We may, however, as well decide that no additional interruption shall be risked by another switching process and stay with ``sap-p1-2`` until the end of the day. To switch back, do this:

          +

          +```

          +  sap-p1-2:

          +    tunnels                # this will make the secondary master write to the local RabbitMQ,

          +                           # stopping to feed all cloud replicas, and stopping the reverse port forward

          +                           # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888

          +  sap-p1-1:

          +    tunnels                # this will establish the -R 8888 port forward from paris-ssh.sapsailing.com:8888

          +                           # again to the primary master on sap-p1-1

          +```

          +

          +### Results and procedure tested in Marseille 2023

          +

          +Both laptops, ``sap-p1-1`` and ``sap-p1-2`` are both running their primary and secondary master processes, respectively. I tested a primary master failure by running ``tunnels-no-master`` on ``sap-p1-1``, and ``tunnels-master`` on ``sap-p1-2``. Then I made a change on the secondary master (``sap-p1-2``) and found it replicated on the cloud replicas. I then took back the change, and that, too, replicated nicely to the cloud. Then I reverted the tunnels to their original form by invoking ``tunnels`` first on ``sap-p1-2`` to release the reverse port forward from ``8888``, then on ``sap-p1-1``. Making changes afterwards on the secondary master no longer reflected in the cloud, as expected.

          +

          +## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica

          +

          +### Scenario

          +

          +The secondary Lenovo P1 experiences an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database.

          +

          +Cloud users are not affected. SwissTiming should be informed about the lack of availability of the "B" system.

          +

          +The mongodb replicaset member will not be available anymore.

          +

          +### Mitigation

          +

          +Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1. 

          +

          +The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel.

          +

          +SSH tunnels won't need to change.

          +

          +### Open questions

          +

          +How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question. 

          +

          +What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook.

          +

          +## Internet failure on Enoshima site

          +

          +### Scenario

          +

          +Internet connectivity is not given anymore at Enoshima On-site.

          +

          +### Open questions

          +

          +How will local/on-site users be connected to the local P1s, assuming that the LAN is still working? 

          +

          +Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics?

          +

          +## TracTrac in the Cloud?

          +

          +### Scenario

          +

          +On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs?

          +

          +### Open questions

          +

          +How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0?

				wiki/info/landscape/paris2024/olympic-setup.md
			
          @@ -270,12 +270,14 @@ On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the follow

           So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-eu-west-3.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance.

          -### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com and paris2024-master.sapsailing.com

          +### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com, paris2024-master.sapsailing.com, and paris2024-secondary-master.sapsailing.com

           In order to allow us to access ``paris2024.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing

           ```

           /usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024.sapsailing.com

          +/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-master.sapsailing.com

          +/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-secondary-master.sapsailing.com

           /usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com

           ```

          @@ -291,7 +293,7 @@ server {

               server_name         paris2024.sapsailing.com;

               ssl_certificate     /etc/ssl/certs/paris2024.sapsailing.com.crt;

               ssl_certificate_key /etc/ssl/private/paris2024.sapsailing.com.key;

          -    ssl_protocols       TLSv1 TLSv1.1 TLSv1.2;

          +    ssl_protocols       TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;

               ssl_ciphers         HIGH:!aNULL:!MD5;

               # set client body size to 100MB

          @@ -360,6 +362,8 @@ The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.prin

           The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``.

          +On ``sap-p1-2`` we run a script ``compare-secondary-to-primary-master`` every five minutes which basically does a ``compareServers -ael`` which uses the REST API for comparing server contents. If a difference is reported by the tool then an e-mail notification is sent out to the list of operators.

          +

           ### Time Synchronizing

           Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added

           ```

...	...	@@ -270,12 +270,14 @@ On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the follow
270	270
271	271	So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-eu-west-3.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance.
272	272
273		-### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com and paris2024-master.sapsailing.com
	273	+### Letsencrypt Certificate for paris2024.sapsailing.com, security-service.sapsailing.com, paris2024-master.sapsailing.com, and paris2024-secondary-master.sapsailing.com
274	274
275	275	In order to allow us to access ``paris2024.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing
276	276
277	277	```
278	278	/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024.sapsailing.com
	279	+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-master.sapsailing.com
	280	+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d paris2024-secondary-master.sapsailing.com
279	281	/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com
280	282	```
281	283
...	...	@@ -291,7 +293,7 @@ server {
291	293	server_name paris2024.sapsailing.com;
292	294	ssl_certificate /etc/ssl/certs/paris2024.sapsailing.com.crt;
293	295	ssl_certificate_key /etc/ssl/private/paris2024.sapsailing.com.key;
294		- ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
	296	+ ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
295	297	ssl_ciphers HIGH:!aNULL:!MD5;
296	298
297	299	# set client body size to 100MB
...	...	@@ -360,6 +362,8 @@ The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.prin
360	362
361	363	The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``.
362	364
	365	+On ``sap-p1-2`` we run a script ``compare-secondary-to-primary-master`` every five minutes which basically does a ``compareServers -ael`` which uses the REST API for comparing server contents. If a difference is reported by the tool then an e-mail notification is sent out to the list of operators.
	366	+
363	367	### Time Synchronizing
364	368	Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added
365	369	```

...	...	@@ -0,0 +1,98 @@
	1	+# Failover Scenarios for [Olympic Setup Paris 2024](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup)
	2	+
	3	+This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/paris2024/olympic-setup). It is work in progress and far from complete.
	4	+
	5	+## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master
	6	+
	7	+### Scenario
	8	+
	9	+The Lenovo P1 device on which the SAP Sailing Analytics Primary Master is running fails and is not available anymore. Reason could be a hardware failure on the CPU, a software deadlock, etc.
	10	+
	11	+The local secondary master on the second P1 device is still available, also the cloud replicas are still available. New data will arrive on the secondary master through TracAPI and through the WindBot connection. The secondary master writes its wind data to the ``security_service`` replica set.
	12	+
	13	+The mongodb replicaset member running on the primary P1 may or may not be available anymore, but at this moment this is not too relevant because in this scenario we assume that the primary master process is not working anymore anyhow.
	14	+
	15	+### Mitigation
	16	+
	17	+The second P1 needs to switch role to be the ("primary") master. Primarily, the outbound replication channel has to be the cloud rabbit in eu-west-3, and the reverse port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` needs to be changed so it points to ``sap-p1-2:8888``. This requires the port forward from ``paris-ssh.sapsailing.com:8888`` to ``sap-p1-1:8888`` to be released, either by ``sap-p1-1`` having died entirely, or by explicitly switching this port forward off.
	18	+
	19	+```
	20	+ sap-p1-1:
	21	+ tunnels-no-master # this will release the -R 8888 port forward
	22	+ sap-p1-2:
	23	+ tunnels-master # this will make the secondary master write to the cloud RabbitMQ
	24	+ # to feed all cloud replicas and will map the reverse port forward
	25	+ # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888
	26	+```
	27	+
	28	+SwissTiming needs to be informed about the switch immediately and can instantly connect to the "B" instance (secondary master).
	29	+
	30	+The ``sap-p1-1`` master can then be re-started, either by only re-launching the Java process:
	31	+
	32	+```
	33	+ cd /home/sailing/servers/master
	34	+ ./stop
	35	+ ./start
	36	+```
	37	+
	38	+When the primary master has recovered, we could in theory switch back. We may, however, as well decide that no additional interruption shall be risked by another switching process and stay with ``sap-p1-2`` until the end of the day. To switch back, do this:
	39	+
	40	+```
	41	+ sap-p1-2:
	42	+ tunnels # this will make the secondary master write to the local RabbitMQ,
	43	+ # stopping to feed all cloud replicas, and stopping the reverse port forward
	44	+ # from paris-ssh.sapsailing.com:8888 to sap-p1-2:8888
	45	+ sap-p1-1:
	46	+ tunnels # this will establish the -R 8888 port forward from paris-ssh.sapsailing.com:8888
	47	+ # again to the primary master on sap-p1-1
	48	+```
	49	+
	50	+### Results and procedure tested in Marseille 2023
	51	+
	52	+Both laptops, ``sap-p1-1`` and ``sap-p1-2`` are both running their primary and secondary master processes, respectively. I tested a primary master failure by running ``tunnels-no-master`` on ``sap-p1-1``, and ``tunnels-master`` on ``sap-p1-2``. Then I made a change on the secondary master (``sap-p1-2``) and found it replicated on the cloud replicas. I then took back the change, and that, too, replicated nicely to the cloud. Then I reverted the tunnels to their original form by invoking ``tunnels`` first on ``sap-p1-2`` to release the reverse port forward from ``8888``, then on ``sap-p1-1``. Making changes afterwards on the secondary master no longer reflected in the cloud, as expected.
	53	+
	54	+## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica
	55	+
	56	+### Scenario
	57	+
	58	+The secondary Lenovo P1 experiences an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database.
	59	+
	60	+Cloud users are not affected. SwissTiming should be informed about the lack of availability of the "B" system.
	61	+
	62	+The mongodb replicaset member will not be available anymore.
	63	+
	64	+### Mitigation
	65	+
	66	+Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1.
	67	+
	68	+The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel.
	69	+
	70	+SSH tunnels won't need to change.
	71	+
	72	+### Open questions
	73	+
	74	+How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question.
	75	+
	76	+What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook.
	77	+
	78	+## Internet failure on Enoshima site
	79	+
	80	+### Scenario
	81	+
	82	+Internet connectivity is not given anymore at Enoshima On-site.
	83	+
	84	+### Open questions
	85	+
	86	+How will local/on-site users be connected to the local P1s, assuming that the LAN is still working?
	87	+
	88	+Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics?
	89	+
	90	+## TracTrac in the Cloud?
	91	+
	92	+### Scenario
	93	+
	94	+On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs?
	95	+
	96	+### Open questions
	97	+
	98	+How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0?