home.md
... ...
@@ -40,7 +40,7 @@ SAP is at the center of today’s technology revolution, developing innovations
40 40
* [[Establishing support@sapsailing.com with AWS SES, SNS, and Lambda|wiki/info/landscape/support-email]]
41 41
* [[Creating an EC2 image for a MongoDB Replica Set from scratch|wiki/info/landscape/creating-ec2-mongodb-image-from-scratch]]
42 42
* [[Setting up dedicated S3 buckets|wiki/info/landscape/s3-bucket-setup]]
43
- * [[Large-Scale Set-Ups, e.g., Olympic Games|wiki/info/landscape/olympic-setup]]
43
+ * [[Large-Scale Set-Ups, e.g., Olympic Games|wiki/info/landscape/tokyo2020/olympic-setup]]
44 44
* [[Building and Deploying|wiki/info/landscape/building-and-deploying]]
45 45
* [[Data Mining Architecture|wiki/info/landscape/data-mining-architecture]]
46 46
* [[Typical Data Mining Scenarios|wiki/info/landscape/typical-data-mining-scenarios]]
wiki/howto/autossh.md
... ...
@@ -24,7 +24,7 @@ It can therefore be useful to use the tool `checkinstall` instead of `sudo make
24 24
It is beneficial to use `tmux` for the usage of `autossh` to create a terminal session, in which `autossh` can run without blocking the whole terminal session. Use `yum install tmux` for the installation.
25 25
26 26
Autossh itself relies on `ssh` and passes its terminal arguments to `ssh`. There are a few exeptions, for example the -M argument, which specifies the monitoring and echo port needed for the operation of `autossh` (uses the specified port and the port + 1). Autossh can be configured via environment variables, refer to the `autossh` manual for further information (`man autossh`).
27
-It can be useful to tweak the `/etc/ssh_config` and `/etc/sshd_config` to assure a quick recovery from failovers (see [Olympic Setup](/wiki/info/landscape/olympic-setup.md#tunnels) for the configuration). A typical command for creating a ssh tunnel is the following:
27
+It can be useful to tweak the `/etc/ssh_config` and `/etc/sshd_config` to assure a quick recovery from failovers (see [Olympic Setup Tokyo 2020](/wiki/info/landscape/tokyo2020/olympic-setup.md#tunnels) and [Olympic Setup Paris 2024](/wiki/info/landscape/paris2024/olympic-setup.md#tunnels) for the configuration). A typical command for creating a ssh tunnel is the following:
28 28
29 29
```
30 30
autossh -M 20000 -N -L *:5672:localhost:5672 -i /root/.ssh/id_rsa <ip-address>
... ...
@@ -32,4 +32,4 @@ autossh -M 20000 -N -L *:5672:localhost:5672 -i /root/.ssh/id_rsa <ip-address>
32 32
33 33
+ -N specifies to not execute remote commands
34 34
+ -L specifies the connection to be forwarded
35
-+ -i specifies the identity file
... ...
\ No newline at end of file
0
++ -i specifies the identity file
wiki/info/landscape/amazon-ec2.md
... ...
@@ -4,7 +4,7 @@
4 4
5 5
## Quickstart
6 6
7
-Our default region in AWS EC2 is eu-west-1 (Ireland). Tests are currently run in the otherwise unused region eu-west-2 (London). Most regular operations can be handled through the AdminConsole's "Advanced / Landscape" tab. See, e.g., [https://security-service.sapsailing.com/gwt/AdminConsole.html#LandscapeManagementPlace:](https://security-service.sapsailing.com/gwt/AdminConsole.html#LandscapeManagementPlace:). Some operations occurring not so frequently still require more in-depth knowledge of steps, manual execution of commands on the command line and some basic Linux understanding. This also goes for [highest-scale set-ups requiring an AWS Global Accelerator with or without Geo-Blocking through AWS Web Application Firewall (WAF) with Web ACLs](https://wiki.sapsailing.com/wiki/info/landscape/olympic-setup#setup-for-the-olympic-summer-games-2020-2021-tokyo_aws-setup_global-accelerator).
7
+Our default region in AWS EC2 is eu-west-1 (Ireland). Tests are currently run in the otherwise unused region eu-west-2 (London). Most regular operations can be handled through the AdminConsole's "Advanced / Landscape" tab. See, e.g., [https://security-service.sapsailing.com/gwt/AdminConsole.html#LandscapeManagementPlace:](https://security-service.sapsailing.com/gwt/AdminConsole.html#LandscapeManagementPlace:). Some operations occurring not so frequently still require more in-depth knowledge of steps, manual execution of commands on the command line and some basic Linux understanding. This also goes for [highest-scale set-ups requiring an AWS Global Accelerator with or without Geo-Blocking through AWS Web Application Firewall (WAF) with Web ACLs](https://wiki.sapsailing.com/wiki/info/landscape/tokyo2020/olympic-setup#setup-for-the-olympic-summer-games-2020-2021-tokyo_aws-setup_global-accelerator).
8 8
9 9
## Important Servers, Hostnames
10 10
wiki/info/landscape/olympic-failover.md
... ...
@@ -1,121 +0,0 @@
1
-# Failover Scenarios for [Olympic Setup](https://wiki.sapsailing.com/wiki/info/landscape/olympic-setup)
2
-
3
-This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/olympic-setup). It is work in progress and far from complete.
4
-
5
-## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master
6
-
7
-### Scenario
8
-
9
-The Lenovo P1 device on which the SAP Sailing Analytics Master / Primary is running fails and is not available anymore. Reason could be a hardware failure on the CPU e.g.
10
-
11
-The local replica on the second P1 device is still available, also the cloud replicas are still available. Data from the past is still available to users and on-site consumers. However no new data will be stored.
12
-
13
-The mongodb replicaset member running on the primary P1 will also be not available anymore.
14
-
15
-### Mitigation
16
-
17
-The second P1 needs to switch role to be a master. This means a new SAP Sailing Analytics Master process has to be started from scratch. The outbound replication channel has to be the cloud rabbit in ap-northeast-1.
18
-
19
-Swisstiming needs to be informed once the master is ready.
20
-
21
-Cloud replicas need to be reconfigured to the new channel that the master uses as outbound.
22
-
23
-SSH tunnels won't need to change.
24
-
25
-The local replica has to be safely shut down, on-site users might experience some change, depending on how we decide with local routing.
26
-
27
-Alternatively, based on the Tokyo 2020 experience, we may consider running the second Lenovo P1 laptop also in "master/primary" mode as a "shadow" where all we need to focus on is initially connecting the TracTrac races and linking them properly to the leaderboard slots. From there on, administration other than adding or removing wind sources, proved to be low effort and close to zero interaction. The official scores are transmitted after confirmation from TracTrac, and so are all start time, finish times, and penalties. This approach could help reduce the time to fail over from the primary to the shadow system in a lot less time than would be required for a re-start of the second Lenovo P1 in master/primary mode.
28
-
29
-### Open questions
30
-
31
-How exactly has the switch to happen?
32
-
33
-Do we re-use the port?
34
-
35
-What are pro's and con's?
36
-
37
-I would like to see a somewhat detailed run-book.
38
-
39
-### Results and procedure tested in Medemblik 2021
40
-
41
-Check master.conf in ``/home/sailing/server/master`` on sap-p1-2 for the desired build and correct setting of the new variable ``INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT``. There is a tunnel listening on 22222 which forwards the traffic through tokyo-ssh to sapsailing.com:22.
42
-So a valid entry would be:
43
-```
44
-INSTALL_FROM_RELEASE=build-202106012325
45
-INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT="trac@localhost:22222"
46
-```
47
-
48
-Now execute, this will download and extract the build:
49
-```
50
-cd /home/sailing/servers/master; rm env.sh; cat master.conf | ./refreshInstance.sh auto-install-from-stdin
51
-```
52
-
53
-Now we stop the replica, make sure you are user ``sailing``:
54
-```
55
-/home/sailing/servers/replica/stop
56
-```
57
-
58
-Wait for process to be stopped/killed.
59
-
60
-Start the correct tunnels script by executing:
61
-```
62
-sudo /usr/local/bin/tunnels-master
63
-```
64
-
65
-Start the master, make sure you are user ``sailing``!
66
-```
67
-/home/sailing/servers/master/start
68
-```
69
-
70
-Check sailing log:
71
-```
72
-tail -f /home/sailing/servers/master/logs/sailing0.log.0
73
-```
74
-
75
-## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica
76
-
77
-### Scenario
78
-
79
-The secondary Lenovo P1 experience an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database.
80
-
81
-Cloud users are not able to see new data, as the replication channel is interrupted.
82
-
83
-For on-site/local users this depends on the decision which URL/IP they are provided with. If we use the local replica to serve them then in this scenario their screens will turn dark.
84
-
85
-The mongodb replicaset member will not be available anymore.
86
-
87
-### Mitigation
88
-
89
-Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1.
90
-
91
-The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel.
92
-
93
-SSH tunnels won't need to change.
94
-
95
-### Open questions
96
-
97
-How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question.
98
-
99
-What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook.
100
-
101
-## Internet failure on Enoshima site
102
-
103
-### Scenario
104
-
105
-Internet connectivity is not given anymore at Enoshima On-site.
106
-
107
-### Open questions
108
-
109
-How will local/on-site users be connected to the local P1s, assuming that the LAN is still working?
110
-
111
-Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics?
112
-
113
-## TracTrac in the Cloud?
114
-
115
-### Scenario
116
-
117
-On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs?
118
-
119
-### Open questions
120
-
121
-How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0?
wiki/info/landscape/olympic-plan-for-paris-marseille-2024.md
... ...
@@ -1,149 +0,0 @@
1
-# Thoughts on Landscape Configuration for Paris 2024 / Marseille
2
-
3
-As a baseline we'll use the [Olympic Setup](/wiki/info/landscape/olympic-setup). The major change, though, would be that instead of running a local on-site master and a local on-site replica we would run two master instances locally on site where one is the "shadow" and the other one is the "production" master.
4
-
5
-We captured a set of scripts and configuration files in out Git repository at ``configuration/on-site-scripts``, in particular also separately for the two laptops, in ``configuration/on-site-scripts/sap-p1-1`` and ``configuration/on-site-scripts/sap-p1-2``.
6
-
7
-Many of these scripts and configuration files contain an explicit reference to the replica set name (and therefore sub-domain name, DB name, tag values, etc.) ``tokyo2020``. With the test event up in July 2023 and the Paris Olympic Summer Games 2024 we should consider making this a parameter of these scripts so it is easy to adjust. We will need different sub-domains for the test event and the Games where the latter most likely will have ``paris2024.sapsailing.com`` as its domain name and hence ``paris2024`` as the replica set name.
8
-
9
-## VPCs and VPC Peering
10
-
11
-From Tokyo2020 we still have the VPCs around in five regions (``eu-west-3``, ``us-west-1``, ``us-east-1``, ``ap-northeast-1``, and ``ap-southeast-2``). They were named ``Tokyo2020`` and our scripts currently depend on this. But VPCs can easily be renamed, and with that we may save a lot of work regarding re-peering those VPCs. We will, though need routes to the new "primary" VPC ``eu-west-3`` from everywhere because the ``paris-ssh.sapsailing.com`` jump host will be based there. Note the inconsistency in capitalization: for the VPC name and as part of instance names such as ``SL Tokyo2020 (Upgrade Replica)`` we use ``Tokyo2020``, for basically everything else it's ``tokyo2020`` (lowercase). When switching to a parameterized approach we should probably harmonize this and use the lowercase name consistently throughout.
12
-
13
-I've started with re-naming the VPCs and their routing tables from ``Tokyo2020`` to ``Paris2024``. I've also added VPC peering between Paris (``eu-west-3``) and California (``us-west-1``), Virginia (``us-east-1``), and Sydney (``ap-southeast-2``). The peering between Paris and Tokyo (``ap-northeast-1``) already existed because for Tokyo 2020, Paris hosted replicas that needed to access the jump host in the Tokyo region.
14
-
15
-I've also copied the "SAP Sailing Analytics 1.150" image to all five regions.
16
-
17
-## Master and Shadow Master
18
-
19
-We will use one laptop as production master, the other as "shadow master." The reason for not using a master and a local replica is that if the local master fails, re-starting later in the event can cause significant delays until all races have loaded and replicated again.
20
-
21
-Both laptops shall run their local RabbitMQ instance. Each of the two master processes can optionally write into its local RabbitMQ through an SSH tunnel which may instead redirect to the cloud-based RabbitMQ for an active Internet/Cloud connection.
22
-
23
-This will require to set up two MongoDB databases (not separate processes, just different DB names), e.g., "paris2024" and "paris2024-shadow". Note that for the shadow master this means that the DB name does not follow the typical naming convention where the ``SERVER_NAME`` property ("paris2024" for both, the primary and the shadow master) also is used as the default MongoDB database name.
24
-
25
-Note: The shadow master must have at least one registered replica because otherwise it would not send any operations into the RabbitMQ replication channel. This can be a challenge for a shadow master that has never seen any replica. We could, for example, simulate a replica registration when the shadow master is still basically empty, using, e.g., a CURL request and then ignoring and later deleting the initial load queue on the local RabbitMQ.
26
-
27
-Furthermore, the shadow master must not send into the production RabbitMQ replication channel that is used by the production master instance while it is not in production itself, because it would duplicate the operations sent. Instead, the shadow master shall use a local RabbitMQ instance to which an SSH tunnel forwards.
28
-
29
-We will install a cron job that regularly performs a "compareServers" between production and shadow master. Any deviation shall be notified using the e-mail notification mechanism in place for all other alerts and monitoring activities, too.
30
-
31
-## Cloud RabbitMQ
32
-
33
-Instead of ``rabbit-ap-northeast-1.sapsailing.com`` we will use ``rabbit-eu-west-3.sapsailing.com`` pointing to the internal IP address of the RabbitMQ installation in ``eu-west-3`` that is used as the default for the on-site master processes as well as for all cloud replicas.
34
-
35
-## ALB and Target Group Set-Up
36
-
37
-Like for Tokyo2020, a separate ALB for the Paris2024 event will be set up in each of the regions supported. They will all be registered with the Global Accelerator to whose anycast-IP adresses the DNS alias record for ``paris2024.sapsailing.com`` will point. Different from Tokyo2020 where we used a static "404 - Not Found" rule as the default rule for all of these ALBs, we can and should use an IP-based target group for the default rule's forwarding and should registed the ``eu-west-1`` "Webserver" (Central Reverse Proxy)'s internal IP address in these target groups. This way, when archiving the event, cached DNS records can still resolve to the Global Accelerator and from there to the ALB(s) and from there, via these default rules, back to the central reverse proxy which then should now where to find the ``paris2024.sapsailing.com`` content in the archive.
38
-
39
-Target group naming conventions have changed slightly since Tokyo2020: instead of ``S-ded-tokyo2020`` we will use only ``S-paris2024`` for the public target group containing all the cloud replicas.
40
-
41
-## Cloud Replica Set-Up
42
-
43
-Based on the cloud replica set-up for Tokyo2020 we can derive the following user data for Paris2024 cloud replicas:
44
-
45
-```
46
-INSTALL_FROM_RELEASE=build-.............
47
-SERVER_NAME=paris2024
48
-MONGODB_URI="mongodb://localhost/paris2024-replica?replicaSet=replica&retryWrites=true&readPreference=nearest"
49
-USE_ENVIRONMENT=live-replica-server
50
-REPLICATION_CHANNEL=paris2024-replica
51
-REPLICATION_HOST=rabbit-eu-west-3.sapsailing.com
52
-REPLICATE_MASTER_SERVLET_HOST=paris-ssh.internal.sapsailing.com
53
-REPLICATE_MASTER_SERVLET_PORT=8888
54
-REPLICATE_MASTER_EXCHANGE_NAME=paris2024
55
-REPLICATE_MASTER_QUEUE_HOST=rabbit-eu-west-3.sapsailing.com
56
-REPLICATE_MASTER_BEARER_TOKEN="***"
57
-ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
58
-```
59
-
60
-Make sure to align the ``INSTALL_FROM_RELEASE`` parameter to match up with the release used on site.
61
-
62
-## SSH Tunnels
63
-
64
-The baseline is again the Tokyo 2020 set-up. Besides the jump host's re-naming from ``tokyo-ssh.sapsailing.com`` to ``paris-ssh.sapsailing.com``. The tunnel scripts for ``sap-p1-2`` that assume ``sap-p1-2`` is (primary) master seem to be faulty. At least, they don't establish a reverse port forward for port 8888 which, however, seems necessary to let cloud replicas reach the on-site master. ``sap-p1-2`` becoming (primary) on-site master means that ``sap-p1-1`` has failed. This can be a problem with the application process but could even be a hardware issue where the entire machine has crashed and has become unavailable. Therefore, ``sap-p1-2`` must take over at least the application and become primary master, and this requires the reverse port forward like this: ``-R '*:8888:localhost:8888'``
65
-
66
-The ports and their semantics:
67
-
68
-* 443: HTTPS port of security-service.sapsailing.com (or its local replacement through NGINX)
69
-* 5673: Outbound RabbitMQ to use by on-site master (regularly to RabbitMQ in eu-west-3, local replacement as fallback)
70
-* 5675: Inbound RabbitMQ (rabbit.internal.sapsailing.com) for replication from security-service.sapsailing.com (or local replacement)
71
-* 9443: NGINX HTTP port on sap-p1-1 (also reverse-forwarded from paris-ssh.sapsailing.com)
72
-* 9444: NGINX HTTP port on sap-p1-2 (also reverse-forwarded from paris-ssh.sapsailing.com)
73
-* 10201: MongoDB on sap-p1-1
74
-* 10202: MongoDB on sap-p1-2
75
-* 10203: MongoDB on paris-ssh.sapsailing.com
76
-* 15673: HTTP to RabbitMQ administration UI of the RabbitMQ server reached on port 5673
77
-* 15675: HTTP to RabbitMQ administration UI of the RabbitMQ server reached on port 5675
78
-* 22222: SSH access to sapsailing.com:22, e.g., for Git access through ``ssh://trac@localhost:22222/home/trac/git``
79
-* 22443: HTTPS access to sapsailing.com:443, e.g., for trying to download a release, although chances are slim this works without local ``/etc/hosts`` magic, e.g., for ``releases.sapsailing.com``
80
-
81
-``/etc/hosts`` must map ``security-service.sapsailing.com`` to ``localhost`` so that local port 443 can be forwarded to different targets based on needs.
82
-
83
-### Regular Operations
84
-
85
-* Three MongoDB nodes form the ``paris2024`` replica set: ``sap-p1-1:10201``, ``sap-p1-2:10202``, and ``paris-ssh.sapsailing.com:10203``, where SSH tunnels forward ports 10201..10203 such that everywhere on the three hosts involved the replica set can be addressed as ``mongodb://localhost:10201,localhost:10202,localhost:10203/?replicaSet=paris2024&retryWrites=true&readPreference=nearest``
86
-* ``sap-p1-1`` runs the ``paris2024`` production master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which tunnels to ``rabbit-eu-west-3.sapsailing.com`` whose admin UI is reached through port 15673 which tunnels to ``rabbit-eu-west-3.sapsailing.com:15672``
87
-* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which tunnels to the RabbitMQ running locally on ``sap-p1-2``, port 5672 whose admin UI is then reached through port 15673 which tunnels to ``sap-p1-2:15672``
88
-* The database ``mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com/security_service?replicaSet=live`` is backed up on a regular basis (nightly) to the local MongoDB replica set ``paris2024`` DB named ``security_service`` which makes it visible especially in the two MongoDB replicas running on ``sap-p1-1`` and ``sap-p1-2``
89
-
90
-### Production Master Failure
91
-
92
-Situation: production master fails, e.g., because of a Java VM crash or a deadlock or user issues such as killing the wrong process...
93
-
94
-Approach: Switch to previous shadow master on ``sap-p1-2``, re-configuring all SSH tunnels accordingly; this includes the 8888 reverse forward from the cloud to the local on-site master, as well as the RabbitMQ forward which needs to switch from the local RabbitMQ running on the shadow master's host to the cloud-based RabbitMQ. Clients such as SwissTiming clients need to switch to the shadow master. To remedy gaps in replication due to the SSH tunnel switch we may want to circulate the replica instances, rolling over to a new set of replicas that fetch a new initial load. If ``sap-p1-1``'s operating system is still alive, its SSH tunnel especially for port 8888 reverse forwarding from ``paris-ssh.sapsailing.com`` must be terminated because otherwise ``sap-p1-2`` may not be able to establish its according reverse forward of port 8888.
95
-
96
-Here are the major changes:
97
-
98
-* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; *outbound replication goes to local port 5673 which tunnels to* ``rabbit-eu-west-3.sapsailing.com`` *whose admin UI is reached through port 15673 which tunnels to* ``rabbit-eu-west-3.sapsailing.com:15672``
99
-
100
-### Internet Failure
101
-
102
-While cloud replicas and hence the ALBs and Global Accelerator will remain reachable with the latest data snapshot at the time the connection is lost, we will then lose the following capabilities:
103
-
104
-* replicate the official ``security-service.sapsailing.com`` service, both, from an HTTP as well as a RabbitMQ perspective; ``rabbit.internal.sapsailing.com`` will then no longer be reachable from the on-site network
105
-* keep the cloud MongoDB instance on ``paris-ssh.sapsailing.com`` synchronized; it will fall behind
106
-* outbound replication to ``rabbit-eu-west-3.sapsailing.com`` and from there on to the cloud replicas in all regions supported will stop
107
-* inbound "reverse" replication from the cloud replicas to the on-site master through the reverse forward of ``paris-ssh.sapsailing.com:8888`` will stop working; the cloud replicas will start buffering the operations to send to their master and will keep re-trying in growing time intervals
108
-
109
-To recover with as little disruption as possible, switching to a local copy of the ``security-service`` and to a local RabbitMQ for "outbound" replication is required. Of course, no replicas will be listening on that local RabbitMQ, but in order to not stop working, the application server will need a RabbitMQ that can be reached on the outbound port 5673. This is achieved by switching the SSH tunnel such that port 5673 will then forward to a RabbitMQ running locally.
110
-
111
-We will then start ``sap-p1-1:/home/sailing/servers/security_service`` on port 8889 which will connect to the local MongoDB replica set still consisting of the two on-site nodes, using the database ``security_service`` that has been obtained as a copy of the ``live`` MongoDB replica set in our default region. This local security service uses the local RabbitMQ running on the same host for its outbound replication. On both on-site laptops the port 443 then needs to forward to the NGINX instance running locally as a reverse proxy for the local security service. On ``sap-p1-1`` this is port 9443, on ``sap-p1-2`` this is port 9444. Furthermore, the port forward from port 5675 and 15675 on both laptops then must point to the local RabbitMQ used outbound by the security service running locally. This will usually be the RabbitMQ running on ``sap-p1-1``, so ``sap-p1-1:5672``, or ``sap-p1-1:15672``, respectively, for the admin port.
112
-
113
-This makes for the following set-up:
114
-
115
-* Only two MongoDB nodes remain available on site from the ``paris2024`` replica set: ``sap-p1-1:10201`` and ``sap-p1-2:10202``, where SSH tunnels forward ports 10201..10203 such that everywhere on the three hosts involved the replica set can be addressed as ``mongodb://localhost:10201,localhost:10202,localhost:10203/?replicaSet=paris2024&retryWrites=true&readPreference=nearest``
116
-* ``sap-p1-1`` runs the ``paris2024`` production master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``sap-p1-1:9443`` which is the port of the local NGINX acting as an SSL-offloading reverse proxy for the security service running locally on port 8889; port 5675 forwards to ``sap-p1-1:5672`` where the local RabbitMQ runs, with the local ``sap-p1-1`` RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which then also tunnels to the local RabbitMQ on ``sap-p1-1:5672``, whose admin UI is reached through port 15673 which tunnels to ``sap-p1-1:15672``
117
-* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``sap-p1-1:9443`` which is the reverse proxy for the security service running on ``sap-p1-1:8889``, and RabbitMQ tunneled through port 5675 to ``sap-p1-1:5672``, with the RabbitMQ admin UI tunneled through port 15675 to ``sap-p1-1:15672``; outbound replication still goes to local port 5673 which tunnels to the RabbitMQ running locally on ``sap-p1-2``, port 5672 whose admin UI is then reached through port 15673 which tunnels to ``sap-p1-2:15672`` which keeps the shadow master's outbound replication from interfering with the production master's outbound replication.
118
-
119
-### Internet Failure Using Shadow Master
120
-
121
-
122
-
123
-## Test Plan for Test Event Marseille July 2023
124
-
125
-### Test Internet Failure
126
-
127
-We shall emulate the lack of a working Internet connection and practice and test the procedures for switching to a local security-service.sapsailing.com installation as well as a local RabbitMQ standing in for the RabbitMQ deployed in the cloud.
128
-
129
-### Test Primary Master Hardware Failure
130
-
131
-This will require switching entirely to the shadow master. Depending on the state of the reverse port forward of the 8888 HTTP port from the cloud we may or may not have to try to terminate a hanging connection in order to be able to establish a new reverse port forward pointing from the cloud to the shadow master. The shadow master also then needs to use the cloud-based RabbitMQ instead of its local one. As a fine-tuning, we can practice the rolling re-sync of all cloud replicas which will likely have missed operations in the meantime.
132
-
133
-### Test Primary Master Java VM Failure
134
-
135
-This can be caused by a deadlock, VM crash, Full GC phase, massive performance degradation or other faulty behavior. We then need to actively close the reverse SSH port forward from the cloud to the production master's 8888 HTTP port, as a precaution switch the RabbitMQ tunnel from the cloud-based to the local RabbitMQ instance so that in case the production master "wakes up" again, e.g., after a Full GC, it does not start to interfere with the now active shadow master on the RabbitMQ fan-out exchange. On the shadow master we need to re-configure the SSH tunnels, particularly to target the cloud-based RabbitMQ and have the reverse port forward on port 8888 target the shadow master on site now.
136
-
137
-### Test Primary Mater Failures with no Internet Connection
138
-
139
-Combine the above scenarios: a failing production master (hardware or VM-only) will require different tunnel re-configurations, especially regarding the then local security-service.sapsailing.com environment which may need to move to the shadow laptop.
140
-
141
-## TODO Before / During On-Site Set-Up (Both, Test Event and OSG2024)
142
-
143
-* Set up Global Accelerator and have the already established DNS record ``paris2024.sapsailing.com`` (placeholder that points to the Dynamic ALB in the default region ``eu-west-1`` to effectively forward to the central reverse proxy and ultimately the archive server's landing page) become an alias pointing to this Global Accelerator
144
-* Set up logging buckets for ALBs in all supported regions
145
-* Set up ALBs in all supported regions, define their three rules (redirect for ``paris2024.sapsailing.com/`` path; forward to public target group for all other ``paris2024.sapsailing.com`` traffic; default rule forwarding to IP-based target group containing the ``eu-west-1`` central reverse proxy) and register them with the Global Accelerator
146
-* Add SSH public keys for password-less private keys of ``sap-p1-1`` and ``sap-p1-2`` to ``ec2-user@paris-ssh.sapsailing.com:.ssh/authorized_keys.org`` so that when the authorized_keys file is updated automatically, the on-site keys are still preserved.
147
-* Create LetsEncrypt certificates for the NGINX installations for paris2024.sapsailing.com and security-service.sapsailing.com and install to the two on-site laptops' NGINX environments
148
-* Ensure the MongoDB installations on both laptops use the ``paris2024`` replica set
149
-* Adjust Athena queries to include all ALB logging buckets from all regions
... ...
\ No newline at end of file
wiki/info/landscape/olympic-setup.md
... ...
@@ -1,727 +0,0 @@
1
-# Setup for the Olympic Summer Games 2020/2021 Tokyo
2
-
3
-[[_TOC_]]
4
-
5
-## Local Installation
6
-
7
-For the Olympic Summer Games 2020/2021 Tokyo we use a dedicated hardware set-up to accommodate the requirements on site. In particular, two Lenovo P1 laptops with equal hardware configuration (32GB RAM, Intel Core i9-9880H) will be established as server devices running various services in a way that we can tolerate, with minimal downtimes, failures of either of the two devices.
8
-
9
-### Installation Packages
10
-
11
-The two laptops run Mint Linux with a fairly modern 5.4 kernel. We keep both up to date with regular ``apt-get update && apt-get upgrade`` executions. Both have an up-to-date SAP JVM 8 (see [https://tools.hana.ondemand.com/#cloud](https://tools.hana.ondemand.com/#cloud)) installed under /opt/sapjvm_8. This is the runtime VM used to run the Java application server process.
12
-
13
-Furthermore, both laptops have a MongoDB 3.6 installation configured through ``/etc/apt/sources.list.d/mongodb-org-3.6.list`` containing the line ``deb http://repo.mongodb.org/apt/debian jessie/mongodb-org/3.6 main``. Their respective configuration can be found under ``/etc/mongod.conf``. The WiredTiger storage engine cache size should be limited. Currently, the following entry in ``/etc/mongod.conf`` does this.
14
-
15
-RabbitMQ is part of the distribution natively, in version 3.6.10-1. It runs on both laptops. Both, RabbitMQ and MongoDB are installed as systemd service units and are launched during the boot sequence. The latest GWT version (currently 2.9.0) is installed under ``/opt/gwt-2.9.0`` in case any development work would need to be done on these machines.
16
-
17
-Both machines have been configured to use 2GB of swap space at ``/swapfile``.
18
-
19
-### Mongo Configuration
20
-
21
-On both laptops, the ``/etc/mongod.conf`` configuration configures ``/var/lib/mongodb`` to be the storage directory, and the in-memory cache size to be 2GB:
22
-
23
-```
24
-storage:
25
- dbPath: /var/lib/mongodb
26
- journal:
27
- enabled: true
28
- wiredTiger:
29
- engineConfig:
30
- cacheSizeGB: 2
31
-```
32
-
33
-The port is set to ``10201`` on ``sap-p1-1``:
34
-
35
-```
36
-# network interfaces
37
-net:
38
- port: 10201
39
- bindIp: 0.0.0.0
40
-```
41
-
42
-and to ``10202`` on ``sap-p1-2``:
43
-
44
-```
45
-# network interfaces
46
-net:
47
- port: 10202
48
- bindIp: 0.0.0.0
49
-```
50
-
51
-Furthermore, the replica set is configured to be ``tokyo2020`` on both:
52
-
53
-```
54
-replication:
55
- oplogSizeMB: 10000
56
- replSetName: tokyo2020
57
-```
58
-
59
-On both laptops we have a second MongoDB configuration ``/etc/mongod-security-service.conf`` which is used by the ``/lib/systemd/system/mongod-security-service.service`` unit which we created as a copy of ``/lib/systemd/system/mongod.service`` and adjusted the line
60
-
61
-```
62
-ExecStart=/usr/bin/mongod --config /etc/mongod-security-service.conf
63
-```
64
-
65
-This second database runs as a replica set ``security_service`` on the default port 27017 and is used as the target for a backup script for the ``security_service`` database. See below. We increased the priority of the ``sap-p1-1`` node from 1 to 2.
66
-
67
-### User Accounts
68
-
69
-The essential user account on both laptops is ``sailing``. The account is intended to be used for running the Java VM that executes the SAP Sailing Analytics server software. The account is currently still protected by a password that our on-site team should know. On both laptops the ``sailing`` account has a password-less SSH key installed under ``/home/sailing/.ssh`` that is contained in the ``known_hosts`` file of ``tokyo-ssh.sapsailing.com`` as well as the mutually other P1 laptop. This way, all tunnels can easily be created once logged on to this ``sailing`` account.
70
-
71
-There are also still two personal accounts ``uhl`` and ``tim`` and an Eclipse development environment under ``/usr/local/eclipse``.
72
-
73
-### Hostnames
74
-
75
-DNS is available on site on the gateway host ``10.1.0.6``. This is essential for resolving ``www.igtimi.com``, the AWS SES SMTP server at ``email-smtp.eu-west-1.amazonaws.com`` and all e-mail address's domains for sendmail's domain verification. The DNS server is set for both, ``sap-p1-1`` and ``sap-p1-2``. It can be set from the command line using ``nmcli connection modify Wired\ connection\ 2 ipv4.dns "10.1.0.6"; nmcli connection down Wired\ connection\ 2; nmcli connection up Wired\ connection\ 2``. Currently, when testing in the SAP facilities with the SAP Guest WiFi, possibly changing IP addresses have to be updated in ``/etc/hosts``.
76
-
77
-The domain name has been set to ``sapsailing.com`` so that the fully-qualified host names are ``sap-p1-1.sapsailing.com`` and ``sap-p1-2.sapsailing.com`` respectively. Using this domain name is helpful later when it comes to the shared security realm established with the central ``security-service.sapsailing.com`` replica set.
78
-
79
-The hostname ``www.sapsailing.com`` is required by master instances when connected to the Internet in order to download polar data and wind estimation data from the archive server. Since direct access to ``www.sapsailing.com`` is blocked, we run this through the SSH tunnel to our jump host; in order to have matching certificates and appropriate hostname-based routing in the cloud for requests to ``www.sapsailing.com`` we alias this hostname in ``/etc/hosts`` to ``127.0.0.1`` (localhost).
80
-
81
-### IP Addresses and VPN
82
-
83
-Here are the IP addresses as indicated by SwissTiming:
84
-
85
-```
86
-Host Internal IP VPN IP
87
------------------------------------------------------------------------------------------
88
-TracTrac A (Linux) 10.1.1.104 10.8.0.128 STSP-SAL_client28
89
-TracTrac B (Linux) 10.1.1.105 10.8.0.129 STSP-SAL_client29
90
-SAP Analytics 1 Server A (Linux) 10.1.3.195 10.8.0.130 STSP-SAL_client30
91
-SAP Analytics 2 Server B (Linux) 10.1.3.197 10.8.0.131
92
-SAP Client Jan (Windows) 10.1.3.220 10.8.0.132
93
-SAP Client Alexandro (Windows) 10.1.3.221 10.8.0.133
94
-SAP Client Axel (Windows) 10.1.3.227 10.8.0.134
95
-TracTrac Dev Jorge (Linux) 10.1.3.228 10.8.0.135
96
-TracTrac Dev Chris (Linux) 10.1.3.233 10.8.0.136
97
-```
98
-
99
-The OpenVPN connection is set up with the GUI of the Linux Desktop. Therefore the management is done through Network Manager. Network Manager has a CLI, ``nmcli``. With that more properties of connections can be modified. The ``connection.secondaries`` property defines the UUID of a connection that will be established as soon as the initial connection is working. With ``nmcli connection show`` you will get the list of connections with the corresponding UUIDs. For the Medemblik Event the OpenVPN connection to the A server is bound to the wired interface that is used with
100
-
101
-```
102
-sudo nmcli connection modify <Wired Connection 2> +connection.secondaries <UUID-of-OpenVPN-A>
103
-```
104
-
105
-For the OpenVPN connections we have received two alternative configuration files together with keys and certificates for our server and work laptops, as well as the certificates for the OpenVPN server (``ca.crt``, ``dh.pem``, ``pfs.key``). The "A" configuration, e.g., provided in a file named ``st-soft-aws_A.ovpn``, looks like this:
106
-
107
-```
108
-client
109
-dev tun
110
-proto udp
111
-remote 3.122.96.235 1195
112
-ca ca.crt
113
-cert {name-of-the-certificate}.crt
114
-key {name-of-the-key}.key
115
-tls-version-min 1.2
116
-tls-cipher TLS-ECDHE-RSA-WITH-AES-128-GCM-SHA256:TLS-ECDHE-ECDSA-WITH-AES-128-GCM-SHA256:TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384:TLS-DHE-RSA-WITH-AES-256-CBC-SHA256
117
-cipher AES-256-CBC
118
-auth SHA512
119
-resolv-retry infinite
120
-auth-retry none
121
-nobind
122
-persist-key
123
-persist-tun
124
-ns-cert-type server
125
-comp-lzo
126
-verb 3
127
-tls-client
128
-tls-auth pfs.key
129
-```
130
-
131
-Here, ``{name-of-the-certificate}.crt`` and ``{name-of-the-key}.key`` need to be replaced by the names of the files corresponding with the host to connect to the OpenVPN. The "B" configuration only differs in the ``remote`` specification, using a different IP address for the OpenVPN server, namely ``52.59.130.167``. It is useful to copy the ``.ovpn`` file and the other ``.key`` and ``.crt`` files into one directory.
132
-
133
-Under Windows download the latest OpenVPN client from [https://openvpn.net/client-connect-vpn-for-windows/](https://openvpn.net/client-connect-vpn-for-windows/). After installation, use the ``.ovpn`` file, adjusted with your personalized key/certificate, to establish the connection.
134
-
135
-On Linux, go to the global settings through Gnome, node "Network" and press the "+" button next to VPN. Import the ``.ovpn`` file, then enable the OpenVPN connection by flicking the switch. The connection will show in the output of
136
-
137
-```
138
- nmcli connection show
139
-```
140
-
141
-The connection IDs will be shown, e.g., ``st-soft-aws_A``. Such a connection can be stopped and restarted from the command line using the following commands:
142
-
143
-```
144
- nmcli connection down st-soft-aws_A
145
- nmcli connection up st-soft-aws_A
146
-```
147
-
148
-### Tunnels
149
-
150
-On both laptops there is a script ``/usr/local/bin/tunnels`` which establishes SSH tunnels using the ``autossh`` tool. The ``autossh`` processes are forked into the background using the ``-f`` option. It seems important to then pass the port to use for sending heartbeats using the ``-M`` option. If this is omitted, according to my experience only one of several ``autossh`` processes survives. However, we have also learned that using the ``-M`` option together with the "port" ``0`` can help to stabilize the connection because in some cases, if ``-M`` is used with a real port, port collisions may result, and furthermore when re-connecting the release of those heartbeat ports cannot become an issue which otherwise it sometimes does. The ``-M 0`` option is particularly helpful when tunnelling to ``sapsailing.com`` which is provided through a network load balancer (NLB).
151
-
152
-During regular operations we assume that we have an Internet connection that allows us to reach our jump host ``tokyo-ssh.sapsailing.com`` through SSH, establishing various port forwards. We also expect TracTrac to have their primary server available. Furthermore, we assume both our laptops to be in service. ``sap-p1-1`` then runs the master server instance, ``sap-p1-2`` runs a local replica. The master on ``sap-p1-1`` replicates the central security service at ``security-service.sapsailing.com`` using the RabbitMQ installation on ``rabbit.internal.sapsailing.com`` in the AWS region `eu-west-1`. The port forwarding through `tokyo-ssh.sapsailing.com` (in `ap-northeast-1`) to the internal RabbitMQ address (in eu-west-1) works through VPC peering. The RabbitMQ instance used for outbound replication, both, into the cloud and for the on-site replica, is `rabbit-ap-northeast-1.sapsailing.com`. The replica on ``sap-p1-2`` obtains its replication stream from there, and for the HTTP connection for "reverse replication" it uses a direct connection to ``sap-p1-1``. The outside world, in particular all "S-ded-tokyo2020-m" master security groups in all regions supported, access the on-site master through a reverse port forward on our jump host ``tokyo-ssh.sapsailing.com:8888`` which under regular operations points to ``sap-p1-1:8888`` where the master process runs.
153
-
154
-On both laptops we establish a port forward from ``localhost:22443`` to ``sapsailing.com:443``. Together with the alias in ``/etc/hosts`` that aliases ``www.sapsailing.com`` to ``localhost``, requests to ``www.sapsailing.com:22443`` will end up on the archive server.
155
-
156
-On both laptops, we maintain SSH connections to ``localhost`` with port forwards to the current TracTrac production server for HTTP, live data, and stored data. In the test we did on 2021-05-25, those port numbers were 9081, 14001, and 14011, respectively, for the primary server, and 9082, 14002, and 14012, respectively, for the secondary server. In addition to these port forwards, an entry in ``/etc/hosts`` is required for the hostname that TracTrac will use on site for their server(s), pointing to ``127.0.0.1`` to let the Sailing Analytics process connect to localhost with the port forwards. Tests have shown that if the port forwards are changed during live operations, e.g., to point to the secondary instead of the primary TracTrac server, the TracAPI continues smoothly which is a great way of handling such a fail-over process without having to re-start our master server necessarily or reconnect to all live races.
157
-
158
-Furthermore, for administrative SSH access from outside, we establish reverse port forwards from our jump host ``tokyo-ssh.sapsailing.com`` to the SSH ports on ``sap-p1-1`` (on port 18122) and ``sap-p1-2`` (on port 18222).
159
-
160
-Both laptops have a forward from ``localhost:22222`` to ``sapsailing.com:22`` through ``tokyo-ssh.sapsailing.com``, in order to be able to have a git remote ``ssh`` with the url ``ssh://trac@localhost:22222/home/trac/git``.
161
-
162
-The port forwards vary for exceptional situations, such as when the Internet connection is not available, or when ``sap-p1-1`` that regularly runs the master process fails and we need to make ``sap-p1-2`` the new master. See below for the details of the configurations for those scenarios.
163
-
164
-The tunnel configurations are established and configured using a set of scripts, each to be found under ``/usr/local/bin`` on each of the two laptops.
165
-
166
-#### ssh_config and sshd_config tweaks
167
-
168
-In order to recover quickly from failures we changed ``/etc/ssh/ssh_config`` on both of the P1s and added the following parameters:
169
-```
170
-ExitOnForwardFailure yes
171
-ConnectTimeout 10
172
-ServerAliveCountMax 3
173
-ServerAliveInterval 10
174
-```
175
-For the server side on tokyo-ssh and on the both P1s the following parameters have been added to ``/etc/ssh/sshd_config``:
176
-```
177
-ClientAliveInterval 3
178
-ClientAliveCountMax 3
179
-```
180
-
181
-ExitOnForwardFailure will force ssh to exit if one of the port forwards fails. ConnectTimeout manages the time in seconds until an initial connection fails. AliveInterval (client and server) manages the time in seconds after ssh/sshd are sending client and server alive probes. CountMax is the number of retries for those probes.
182
-
183
-The settings have been verified by executing a network change on both the laptops, the ssh tunnel returns after a couple of seconds.
184
-
185
-#### Regular Operations: master on sap-p1-1, replica on sap-p1-2, with Internet / Cloud connection
186
-
187
-On sap-p1-1 two SSH connections are maintained, with the following default port forwards, assuming sap-p1-1 is the local master:
188
-
189
-* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10201<--10201; 18122<--22; 443:security-service.sapsailing.com:443; 8888<--8888; 9443<--9443
190
-* sap-p1-2: 10202-->10202; 10201<--10201
191
-
192
-On sap-p1-2, the following SSH connections are maintained, assuming sap-p1-2 is the local replica:
193
-
194
-- tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10202<--10202; 9444<--9443
195
-
196
-A useful set of entries in your personal ``~/.ssh/config`` file for "off-site" use may look like this:
197
-
198
-```
199
-Host tokyo
200
- Hostname tokyo-ssh.sapsailing.com
201
- User ec2-user
202
- ForwardAgent yes
203
- ForwardX11Trusted yes
204
- LocalForward 18122 localhost:18122
205
- LocalForward 18222 localhost:18222
206
- LocalForward 9443 localhost:9443
207
- LocalForward 9444 localhost:9444
208
-
209
-Host sap-p1-1
210
- Hostname localhost
211
- Port 18122
212
- User sailing
213
- ForwardAgent yes
214
- ForwardX11Trusted yes
215
-
216
-Host sap-p1-2
217
- Hostname localhost
218
- Port 18222
219
- User sailing
220
- ForwardAgent yes
221
- ForwardX11Trusted yes
222
-```
223
-
224
-It will allow you to log on to the "jump host" ``tokyo-ssh.sapsailing.com`` with the simple command ``ssh tokyo`` and will establish the port forwards that will then allow you to connect to the two laptops using ``ssh sap-p1-1`` and ``ssh sap-p1-2``, respectively. Of course, when on site and with the two laptops in direct reach you may adjust the host entries for ``sap-p1-1`` and ``sap-p1-2`` accordingly, and you may then wish to establish only an SSH connection to ``sap-p1-1`` which then does the port forwards for HTTPS ports 9443/9444. This could look like this:
225
-
226
-```
227
-Host sap-p1-1
228
- Hostname 10.1.3.195
229
- Port 22
230
- User sailing
231
- ForwardAgent yes
232
- ForwardX11Trusted yes
233
- LocalForward 9443 localhost:9443
234
- LocalForward 9444 10.1.3.197:9443
235
-
236
-Host sap-p1-2
237
- Hostname 10.1.3.197
238
- Port 22
239
- User sailing
240
- ForwardAgent yes
241
- ForwardX11Trusted yes
242
-```
243
-
244
-#### Operations with sap-p1-1 failing: master on sap-p1-2, with Internet / Cloud connection
245
-
246
-On sap-p1-1, if the operating system still runs and the failure affects only the Java process running the SAP Sailing Analytics, two SSH connections are maintained, with the following default port forwards, assuming sap-p1-1 is not running an SAP Sailing Analytics process currently:
247
-
248
-* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10201<--10201; 18122<--22; 443:security-service.sapsailing.com:443
249
-* sap-p1-2: 10202-->10202; 10201<--10201
250
-
251
-On sap-p1-2 two SSH connections are maintained, with the following default port forwards, assuming sap-p1-2 is the local master:
252
-
253
-* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10202<--10202; 18222<--22; 443:security-service.sapsailing.com:443; 8888<--8888
254
-* sap-p1-1 (if the operating system on sap-p1-1 still runs): 10202-->10202; 10201<--10201
255
-
256
-So the essential change is that the reverse forward from ``tokyo-ssh.sapsailing.com:8888`` now targets ``sap-p1-2:8888`` where we now assume the failover master to be running.
257
-
258
-#### Operations with Internet failing
259
-
260
-When the Internet connection fails, replicating the security service from ``security-service.sapsailing.com`` / ``rabbit.internal.sapsailing.com`` will no longer be possible. Neither will outbound replication to ``rabbit-ap-northeast-1.sapsailing.com`` be possible, and cloud replicas won't be able to reach the on-site master anymore through the ``tokyo-ssh.sapsailing.com:8888`` reverse port forward. This also has an effect on the local on-site replica which no longer will be able to reach ``rabbit-ap-northeast-1.sapsailing.com`` which provides the on-site replica with the operation stream under regular circumstances.
261
-
262
-There is little we can do against the lack of Internet connection regarding providing data to the cloud replicas and maintaining replication with ``security-service.sapsailing.com`` (we could theoretically try to work with local WiFi hotspots; but the key problem will be that TracTrac then neither has Internet connectivity for their on-site server, and we would have to radically change to a cloud-only set-up which is probably beyond what we'd be doing in this case). But we can ensure continued local operations with the replica on ``sap-p1-2`` now using a local on-site RabbitMQ installation between the two instances. For this, we replace the port forwards that during regular operations point to ``rabbit-ap-northeast-1.sapsailing.com`` by port forwards pointing to the RabbitMQ process on ``sap-p1-2``.
263
-
264
-On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the following port forwards:
265
-
266
-* sap-p1-2: 10202-->10202; 10201<--10201; 5763-->localhost:5672
267
-
268
-So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-ap-northeast-1.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance.
269
-
270
-### Letsencrypt Certificate for tokyo2020.sapsailing.com, security-service.sapsailing.com and tokyo2020-master.sapsailing.com
271
-
272
-In order to allow us to access ``tokyo2020.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing
273
-
274
-```
275
-/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d tokyo2020.sapsailing.com
276
-/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com
277
-```
278
-
279
-as ``root`` on ``sapsailing.com``. The challenge displayed can be solved by creating an ALB rule for hostname header ``tokyo2020.sapsailing.com`` and the path as issued in the output of the ``certbot`` command, and as action specify a fixed response, response code 200, and pasting as text/plain the challenge data printed by the ``certbot`` command. Wait a few seconds, then confirm the Certbot prompt. The certificate will be issued and stored under ``/etc/letsencrypt/live/tokyo2020.sapsailing.com`` from where I copied it to ``/home/sailing/Downloads/letsencrypt`` on both laptops for later use with a local Apache httpd server. The certificate will expire on 2021-08-19, so after the Olympic Games, so we don't have to worry about renewing it.
280
-
281
-### Local NGINX Webserver Setup
282
-
283
-In order to be able to access the applications running on the local on-site laptops using HTTPS there is a web server on each of the two laptops, listening on port 9443 (HTTPS). The configuration for this is under ``/etc/nginx/sites-enabled/tokyo2020`` and looks like this:
284
-
285
-```
286
-server {
287
- listen 9443 ssl;
288
- server_name tokyo2020.sapsailing.com;
289
- ssl_certificate /etc/ssl/certs/tokyo2020.sapsailing.com.crt;
290
- ssl_certificate_key /etc/ssl/private/tokyo2020.sapsailing.com.key;
291
- ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
292
- ssl_ciphers HIGH:!aNULL:!MD5;
293
-
294
- location / {
295
- proxy_pass http://127.0.0.1:8888;
296
- }
297
-}
298
-```
299
-
300
-The "Let's Encrypt"-provided certificate is used for SSL termination. With tokyo2020.sapsailing.com aliased in ``/etc/hosts`` to the address of the current master server, this allows accessing ``https://tokyo2020.sapsailing.com:9443`` with all benefits of cookie / session authentication.
301
-
302
-Likewise, ``/etc/nginx/sites-enabled/security-service`` forwards to 127.0.0.1:8889 where a local copy of the security service may be deployed in case the Internet fails. In this case, the local port 443 must be forwarded to the NGINX port 9443 instead of security-service.sapsailing.com:443 through tokyo-ssh.sapsailing.com.
303
-
304
-On sap-p1-1 is currently a nginx listening to tokyo2020-master.sapsailing.com with the following configuration:
305
-
306
-```
307
-server {
308
- listen 9443 ssl;
309
- server_name tokyo2020-master.sapsailing.com;
310
- ssl_certificate /etc/ssl/private/tokyo2020-master.sapsailing.com.fullchain.pem;
311
- ssl_certificate_key /etc/ssl/private/tokyo2020-master.sapsailing.com.privkey.pem;
312
- ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
313
- ssl_ciphers HIGH:!aNULL:!MD5;
314
-
315
- location / {
316
- proxy_pass http://127.0.0.1:8888;
317
- }
318
-}
319
-```
320
-
321
-
322
-
323
-### Backup
324
-
325
-borgbackup is used to backup the ``/`` folder of both laptops towards the other machine. Folder where the borg repository is located is: ``/backup``.
326
-
327
-The backup from sap-p1-1 to sap-p1-2 runs at 01:00 each day, and the backup from sap-p1-2 to sap-p1-1 runs at 02:00 each day. Details about the configuration can be found in ``/root/borg-backup.sh`` on either machine. Log files for the backup run are in ``/var/log/backup.log``. Crontab file is in ``/root``.
328
-
329
-Both ``/backup`` folders have been mirrored to a S3 bucket called ``backup-sap-p1`` on June 14th.
330
-
331
-### Monitoring and e-Mail Alerting
332
-
333
-To be able to use ``sendmail`` to send notifications via email it needs to be installed and configured to use the AWS SES as smtp relay:
334
-```
335
-sudo apt install sendmail
336
-```
337
-
338
-Follow the instructions on [https://docs.aws.amazon.com/ses/latest/DeveloperGuide/send-email-sendmail.html](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/send-email-sendmail.html) with one exception, the content that needs to be added to ``sendmail.mc`` looks like:
339
-```
340
-define(`SMART_HOST', `email-smtp.eu-west-1.amazonaws.com')dnl
341
-define(`RELAY_MAILER_ARGS', `TCP $h 587')dnl
342
-define(`confAUTH_MECHANISMS', `LOGIN PLAIN')dnl
343
-FEATURE(`authinfo', `hash -o /etc/mail/authinfo.db')dnl
344
-MASQUERADE_AS(`sapsailing.com')dnl
345
-FEATURE(masquerade_envelope)dnl
346
-FEATURE(masquerade_entire_domain)dnl
347
-```
348
-The authentication details can be fetched from the content of ``/root/mail.properties`` of any running sailing EC2 instance.
349
-
350
-Both laptops, ``sap-p1-1`` and ``sap-p1-2`` have monitoring scripts from the git folder ``configuration/on-site-scripts`` linked to ``/usr/local/bin``. These in particular include ``monitor-autossh-tunnels`` and ``monitor-mongo-replica-set-delay`` as well as a ``notify-operators`` script which contains the list of e-mail addresses to notify in case an alert occurs.
351
-
352
-The ``monitor-autossh-tunnels`` script checks all running ``autossh`` processes and looks for their corresponding ``ssh`` child processes. If any of them is missing, an alert is sent using ``notify-operators``.
353
-
354
-The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.printSecondaryReplicationInfo()`` and logs it to ``/tmp/mongo-replica-set-delay``. The average of the last ten values is compared to a threshold (currently 3s), and an alert is sent using ``notify-operators`` if the threshold is exceeded.
355
-
356
-The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``.
357
-
358
-### Time Synchronizing
359
-Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added
360
-```
361
-# Tokyo2020 configuration
362
-server 10.1.3.221 iburst
363
-```
364
-to ``/etc/chrony/chrony.conf`` on the clients.
365
-Added
366
-```
367
-# FOR TOKYO SERVER SETUP
368
-allow all
369
-local stratum 10
370
-```
371
-to the server file, started ```chronyd``` service.
372
-
373
-## AWS Setup
374
-
375
-Our primary AWS region for the event will be Tokyo (ap-northeast-1). There, we have reserved the elastic IP ``52.194.91.94`` to which we've mapped the Route53 hostname ``tokyo-ssh.sapsailing.com`` with a simple A-record. The host assigned to the IP/hostname is to be used as a "jump host" for SSH tunnels. It runs Amazon Linux with a login-user named ``ec2-user``. The ``ec2-user`` has ``sudo`` permission. In the root user's crontab we have the same set of scripts hooked up that in our eu-west-1 production landscape is responsible for obtaining and installing the landscape manager's SSH public keys to the login user's account, aligning the set of ``authorized_keys`` with those of the registered landscape managers (users with permission ``LANDSCAPE:MANAGE:AWS``). The ``authorized_keys.org`` file also contains the two public SSH keys of the ``sailing`` accounts on the two laptops, so each time the script produces a new ``authorized_keys`` file for the ``ec2-user``, the ``sailing`` keys for the laptop tunnels don't get lost.
376
-
377
-I added the EPEL repository like this:
378
-
379
-```
380
- yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
381
-```
382
-
383
-Our "favorite" Availability Zone (AZ) in ap-northeast-1 is "1d" / "ap-northeast-1d".
384
-
385
-The same host ``tokyo-ssh.sapsailing.com`` also runs a MongoDB 3.6 instance on port 10203.
386
-
387
-For RabbitMQ we run a separate host, based on AWS Ubuntu 20. It brings the ``rabbitmq-server`` package with it (version 3.8.2 on Erlang 22.2.7), and we'll install it with default settings, except for the following change: In the new file ``/etc/rabbitmq/rabbitmq.conf`` we enter the line
388
-
389
-```
390
- loopback_users = none
391
-```
392
-
393
-which allows clients from other hosts to connect (note how this works differently on different version of RabbitMQ; the local laptops have to use a different syntax in their ``rabbitmq.config`` file). The security groups for the RabbitMQ server are configured such that only ``172.0.0.0/8`` addresses from our VPCs can connect.
394
-
395
-The RabbitMQ management plugin is enabled using ``rabbitmq-plugins enable rabbitmq_management`` for access from localhost. This will require again an SSH tunnel to the host. The host's default user is ``ubuntu``. The RabbitMQ management plugin is active on port 15672 and accessible only from localhost or an SSH tunnel with port forward ending at this host. RabbitMQ itself listens on the default port 5672. With this set-up, RabbitMQ traffic for this event remains independent and undisturbed from any other RabbitMQ traffic from other servers in our default ``eu-west-1`` landscape, such as ``my.sapsailing.com``. The hostname pointing to the internal IP address of the RabbitMQ host is ``rabbit-ap-northeast-1.sapsailing.com`` and has a timeout of 60s.
396
-
397
-An autossh tunnel is established from ``tokyo-ssh.sapsailing.com`` to ``rabbit-ap-northeast-1.sapsailing.com`` which forwards port 15673 to port 15672, thus exposing the RabbitMQ web interface which otherwise only responds to localhost. This autossh tunnel is established by a systemctl service that is described in ``/etc/systemd/system/autossh-port-forwards.service`` in ``tokyo-ssh.sapsailing.com``.
398
-
399
-### Local setup of rabbitmq
400
-
401
-The above configuration needs also to be set on the rabbitmq installations of the P1s. The rabbitmq-server package has version 3.6.10. In that version the config file is located in ``/etc/rabbitmq/rabbitmq.config``, the entry is ``[{rabbit, [{loopback_users, []}]}].`` Further documentation for this version can be found here: [http://previous.rabbitmq.com/v3_6_x/configure.html](http://previous.rabbitmq.com/v3_6_x/configure.html)
402
-
403
-### Cross-Region VPC Peering
404
-
405
-The primary AWS region for the tokyo2020 replica set is ap-northeast-1 (Tokyo). In order to provide low latencies for the RHBs we'd like to add replicas also in other regions. Since we want to not expose the RabbitMQ running ap-northeast-1 to the outside world, we plan to peer the VPCs of other regions with the one in ap-northeast-1.
406
-
407
-The pre-requisite for VPCs to get peered is that their CIDRs (such as 172.31.0.0/16) don't overlap. The default VPC in each region always uses the same CIDR (172.31.0.0/16), and hence in order to peer VPCs all but one must be non-default VPC. To avoid confusion when launching instances or setting up security groups it can be adequate for those peering regions other than our default region ``eu-west-1`` to set up non-default VPCs with peering-capable CIDRs and remove the default VPC. This way users cannot accidentally launch instances or define security groups for any VPC other than the peered one.
408
-
409
-After having peered the VPCs, the VPCs default routing table must be extended by a route to the peered VPC's CIDR using the peering connection.
410
-
411
-With peering in place it is possible to reach instances in peered VPCs by their internal IPs. In particular, it is possible to connect to a RabbitMQ instance with the internal IP and port 5672 even if that RabbitMQ runs in a different region whose VPC is peered.
412
-
413
-### Global Accelerator
414
-
415
-We have created a Global Accelerator [Tokyo2020](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#AcceleratorDetails:AcceleratorArn=arn:aws:globalaccelerator::017363970217:accelerator/8ddd5afb-dd8d-4e8b-a22f-443a47240a94) which manages cross-region load balancing for us. There are two listeners: one for port 80 (HTTP) and one for port 443 (HTTPS). For each region an endpoint group must be created for both of the listeners, and the application load balancer (ALB) in that region has to be added as an endpoint.
416
-
417
-The Route53 entry ``tokyo2020.sapsailing.com`` now is an alias A record pointing to this global accelerator (``aca060e6eabf4ba3e.awsglobalaccelerator.com.``).
418
-
419
-### Geo-Blocking
420
-
421
-While for Tokyo 2020 this was not requested, for Paris 2024 we heard rumors that it may. If it does, using the [AWS Web Application Firewall (WAF)](https://us-east-1.console.aws.amazon.com/wafv2/homev2/start) provides the solution. There, we can create so-called Web Access Control Lists (Web ACLs) which need to be created per region where an ALB is used.
422
-
423
-A Web ACL consists of a number of rules and has a default action (typically "Allow" or "Block") for those requests not matched by any rule. An ACL can be associated with one or more resources, in particular with Application Load Balancers (ALBs) deployed in the region.
424
-
425
-Rules, in turn, consist of statements that can be combined using logical operators. The rule type of interest for geo-blocking is "Originates from a country in" where one or more countries can be selected. When combined with an "Allow" or "Block" action, this results in the geo-blocking behavior desired.
426
-
427
-For requests blocked by the rule, the response code, response headers and message body to return to the client can be configured. We can use this, e.g., to configure a 301 re-direct to a static page that informs the user about the geo-blocking.
428
-
429
-### Application Load Balancers (ALBs) and Target Groups
430
-
431
-In each region supported, a dedicated load balancer for the Global Accelerator-based event setup has been set up (``Tokyo2020ALB`` or simply ``ALB``). A single target group with the usual settings (port 8888, health check on ``/gwt/status``, etc.) must exist: ``S-ded-tokyo2020`` (public).
432
-
433
-Note that no dedicated ``-m`` master target group is established. The reason is that the AWS Global Accelerator judges an ALB's health by looking at _all_ its target groups; should only a single target group not have a healthy target, the Global Accelerator considers the entire ALB unhealthy. With this, as soon as the on-site master server is unreachable, e.g., during an upgrade, all those ALBs would enter the "unhealthy" state from the Global Accelerator's perspective, and all public replicas which are still healthy would no longer receive traffic; the site would go "black." Therefore, we must ensure that the ALBs targeted by the Global Accelerator only have a single target group which only has the public replicas in that region as its targets.
434
-
435
-Each ALB has an HTTP and an HTTPS listener. The HTTP listener has only a single rule redirecting all traffic permanently (301) to the corresponding HTTPS request. The HTTPS listener has three rules: the ``/`` path for ``tokyo2020.sapsailing.com`` is re-directed to the Olympic event with ID ``25c65ff1-68b8-4734-a35f-75c9641e52f8``. All other traffic for ``tokyo2020.sapsailing.com`` goes to the public target group holding the regional replica(s). A default rule returns a 404 status with a static ``Not found`` text.
436
-
437
-## Landscape Architecture
438
-
439
-We have applied for a single SSH tunnel to IP address ``52.194.91.94`` which is our elastic IP for our SSH jump host in ap-northeast-1(d).
440
-
441
-The default production set-up is defined as follows:
442
-
443
-### MongoDB
444
-
445
-Three MongoDB nodes are intended to run during regular operations: sap-p1-1:10201, sap-p1-2:10202, and tokyo-ssh.sapsailing.com:10203. Since we have to work with SSH tunnels to keep things connected, we map everything using ``localhost`` ports such that both, sap-p1-2 and tokyo-ssh see sap-p1-1:10201 as their localhost:10201, and that both, sap-p1-1 and tokyo-ssh see sap-p1-2:10202 as their respective localhost:10202. Both, sap-p1-1 and sap-p1-2 see tokyo-ssh:10203 as their localhost:10203. This way, the MongoDB URI can be specified as
446
-
447
-```
448
- mongodb://localhost:10201,localhost:10202,localhost:10203/tokyo2020?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest
449
-```
450
-
451
-The cloud replica is not supposed to become primary, except for maybe in the unlikely event where operations would move entirely to the cloud. To achieve this, the cloud replica has priority 0 which can be configured like this:
452
-
453
-```
454
- tokyo2020:PRIMARY> cfg = rs.conf()
455
- # Then search for the member localhost:10203; let's assume, it's in cfg.members[0] :
456
- cfg.members[0].priority=0
457
- rs.reconfig(cfg)
458
-```
459
-
460
-All cloud replicas shall use a MongoDB database name ``tokyo2020-replica``. In those regions where we don't have dedicated MongoDB support established (basically all but eu-west-1 currently), an image should be used that has a MongoDB server configured to use ``/home/sailing/mongo`` as its data directory and ``replica`` as its replica set name. See AMI SAP Sailing Analytics App HVM with MongoDB 1.137 (ami-05b6c7b1244f49d54) in ap-northeast-1 (already copied to the other peered regions except eu-west-1).
461
-
462
-One way to monitor the health and replication status of the replica set is running the following command:
463
-
464
-```
465
- watch 'echo "rs.printSecondaryReplicationInfo()" | \
466
- mongo "mongodb://localhost:10201/?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest" | \
467
- grep "\(^source:\)\|\(syncedTo:\)\|\(behind the primary\)"'
468
-```
469
-
470
-It shows the replication state and in particular the delay of the replicas. A cronjob exists for ``sailing@sap-p1-1`` which triggers ``/usr/local/bin/monitor-mongo-replica-set-delay`` every minute which will use ``/usr/local/bin/notify-operators`` in case the average replication delay for the last ten read-outs exceeds a threshold (currently 3s). We have a cron job monitoring this (see above) and sending out alerts if things start slowing down.
471
-
472
-In order to have a local copy of the ``security_service`` database, a CRON job exists for user ``sailing`` on ``sap-p1-1`` which executes the ``/usr/local/bin/clone-security-service-db-safe-exit`` script (versioned in git under ``configuration/on-site-scripts/clone-security-service-db-safe-exit``) once per hour. See ``/home/sailing/crontab``. The script dumps ``security_service`` from the ``live`` replica set in ``eu-west-1`` to the ``/tmp/dump`` directory on ``ec2-user@tokyo-ssh.sapsailing.com`` and then sends the directory content as a ``tar.gz`` stream through SSH and restores it on the local ``mongodb://sap-p1-1:27017,sap-p1-2/security_service?replicaSet=security_service`` replica set, after copying an existing local ``security_service`` database to ``security_service_bak``. This way, even if the Internet connection dies during this cloning process, a valid copy still exists in the local ``tokyo2020`` replica set which can be copied back to ``security_service`` using the MongoDB shell command
473
-
474
-```
475
- db.copyDatabase("security_service_bak", "security_service")
476
-```
477
-
478
-### Master
479
-
480
-The master configuration is described in ``/home/sailing/servers/master/master.conf`` and can be used to produce a clean set-up like this:
481
-
482
-```
483
- rm env.sh; cat master.conf | ./refreshInstance.sh auto-install-from-stdin
484
-```
485
-
486
-If the laptops cannot reach ``https://releases.sapsailing.com`` due to connectivity constraints, releases and environments can be downloaded through other channels to ``sap-p1-1:/home/trac/releases``, and the variable ``INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT`` can be set to ``sailing@sap-p1-1`` to fetch the release file and environment file from there by SCP. Alternatively, ``sap-p1-2:/home/trac/releases`` may be used for the same.
487
-
488
-This way, a clean new ``env.sh`` file will be produced from the config file, including the download and installation of a release. The ``master.conf`` file looks approximately like this:
489
-
490
-```
491
-INSTALL_FROM_RELEASE=build-202106012325
492
-SERVER_NAME=tokyo2020
493
-MONGODB_URI="mongodb://localhost:10201,localhost:10202,localhost:10203/${SERVER_NAME}?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest"
494
-# RabbitMQ in eu-west-1 (rabbit.internal.sapsailing.com) is expected to be found through SSH tunnel on localhost:5675
495
-# Replication of shared services from central security-service.sapsailing.com through SSH tunnel 443:security-service.sapsailing.com:443
496
-# with a local /etc/hosts entry mapping security-service.sapsailing.com to 127.0.0.1
497
-REPLICATE_MASTER_QUEUE_HOST=localhost
498
-REPLICATE_MASTER_QUEUE_PORT=5675
499
-REPLICATE_MASTER_BEARER_TOKEN="***"
500
-# Outbound replication to RabbitMQ through SSH tunnel with port forward on port 5673, regularly to rabbit-ap-northeast-1.sapsailing.com
501
-# Can be re-mapped to the RabbitMQ running on sap-p1-2
502
-REPLICATION_HOST=localhost
503
-REPLICATION_PORT=5673
504
-USE_ENVIRONMENT=live-master-server
505
-ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
506
-```
507
-
508
-### Replicas
509
-
510
-The on-site replica on ``sap-p1-2`` can be configured with a ``replica.conf`` file in ``/home/sailing/servers/replica``, using
511
-
512
-```
513
- rm env.sh; cat replica.conf | ./refreshInstance auto-install-from-stdin
514
-```
515
-
516
-The file looks like this:
517
-
518
-```
519
-# Regular operations; sap-p1-2 replicates sap-p1-1 using the rabbit-ap-northeast-1.sapsailing.com RabbitMQ in the cloud through SSH tunnel.
520
-# Outbound replication, though not expected to become active, goes to a local RabbitMQ
521
-INSTALL_FROM_RELEASE=build-202106012325
522
-SERVER_NAME=tokyo2020
523
-MONGODB_URI="mongodb://localhost:10201,localhost:10202,localhost:10203/${SERVER_NAME}-replica?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest"
524
-# RabbitMQ in ap-northeast-1 is expected to be found locally on port 5673
525
-REPLICATE_MASTER_SERVLET_HOST=sap-p1-1
526
-REPLICATE_MASTER_SERVLET_PORT=8888
527
-REPLICATE_MASTER_QUEUE_HOST=localhost
528
-REPLICATE_MASTER_QUEUE_PORT=5673
529
-REPLICATE_MASTER_BEARER_TOKEN="***"
530
-# Outbound replication to RabbitMQ running locally on sap-p1-2
531
-REPLICATION_HOST=localhost
532
-REPLICATION_PORT=5672
533
-REPLICATION_CHANNEL=${SERVER_NAME}-replica
534
-USE_ENVIRONMENT=live-replica-server
535
-ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
536
-```
537
-
538
-Replicas in region ``eu-west-1`` can be launched using the following user data, making use of the established MongoDB live replica set in the region:
539
-
540
-```
541
-INSTALL_FROM_RELEASE=build-202106012325
542
-SERVER_NAME=tokyo2020
543
-MONGODB_URI="mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com,dbserver.internal.sapsailing.com:10203/tokyo2020-replica?replicaSet=live&retryWrites=true&readPreference=nearest"
544
-USE_ENVIRONMENT=live-replica-server
545
-REPLICATION_CHANNEL=tokyo2020-replica
546
-REPLICATION_HOST=rabbit-ap-northeast-1.sapsailing.com
547
-REPLICATE_MASTER_SERVLET_HOST=tokyo-ssh.internal.sapsailing.com
548
-REPLICATE_MASTER_SERVLET_PORT=8888
549
-REPLICATE_MASTER_EXCHANGE_NAME=tokyo2020
550
-REPLICATE_MASTER_QUEUE_HOST=rabbit-ap-northeast-1.sapsailing.com
551
-REPLICATE_MASTER_BEARER_TOKEN="***"
552
-ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
553
-```
554
-
555
-(Adjust the release accordingly, of course). (NOTE: During the first production days of the event we noticed that it was really a BAD IDEA to have all replicas use the same DB set-up, all writing to the MongoDB PRIMARY of the "live" replica set in eu-west-1. With tens of replicas running concurrently, this led to a massive block-up based on MongoDB not writing fast enough. This gave rise to a new application server AMI which now has a MongoDB set-up included, using "replica" as the MongoDB replica set name. Now, each replica hence can write into its own MongoDB instance, isolated from all others and scaling linearly.)
556
-
557
-In other regions, instead an instance-local MongoDB shall be used for each replica, not interfering with each other or with other databases:
558
-
559
-```
560
-INSTALL_FROM_RELEASE=build-202106012325
561
-SERVER_NAME=tokyo2020
562
-MONGODB_URI="mongodb://localhost/tokyo2020-replica?replicaSet=replica&retryWrites=true&readPreference=nearest"
563
-USE_ENVIRONMENT=live-replica-server
564
-REPLICATION_CHANNEL=tokyo2020-replica
565
-REPLICATION_HOST=rabbit-ap-northeast-1.sapsailing.com
566
-REPLICATE_MASTER_SERVLET_HOST=tokyo-ssh.internal.sapsailing.com
567
-REPLICATE_MASTER_SERVLET_PORT=8888
568
-REPLICATE_MASTER_EXCHANGE_NAME=tokyo2020
569
-REPLICATE_MASTER_QUEUE_HOST=rabbit-ap-northeast-1.sapsailing.com
570
-REPLICATE_MASTER_BEARER_TOKEN="***"
571
-ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
572
-```
573
-
574
-### Application Servers
575
-
576
-``sap-p1-1`` normally is the master for the ``tokyo2020`` replica set. The application server directory is found under ``/home/sailing/servers/master``, and the master's HTTP port is 8888. It shall replicate the shared services, in particular ``SecurityServiceImpl``, from ``security-service.sapsailing.com``, like any normal server in our landscape, only that here we have to make sure we can target the default RabbitMQ in eu-west-1 and can see the ``security-service.sapsailing.com`` master directly or even better the load balancer.
577
-
578
-SSH local port forwards (configured with the ``-L`` option) that use hostnames instead of IP addresses for the remote host specification are resolved each time a new connection is established through this forward. If the DNS entry resolves to multiple IPs or if the DNS entry changes over time, later connection requests through the port forward will honor the new host name's DNS resolution.
579
-
580
-Furthermore, there is a configuration under ``/home/sailing/servers/security_service`` which can be fired up with port 8889, using the local ``security_service`` database that a script ``/usr/local/bin/clone-security-service-db`` on the jump host ``tokyo-ssh.sapsailing.com`` updates on an hourly basis as long as an Internet connection is available. This can be used as a replacement of the official ``security-service.sapsailing.com`` service. Both laptops have an ``/etc/hosts`` entry mapping ``security-service.sapsailing.com`` to ``127.0.0.1`` and work with flexible SSH port forwards to decide whether the official Internet-based or the local copy of the security service shall be used.
581
-
582
-``sap-p1-2`` normally is a replica for the ``tokyo2020`` replica set, using the local RabbitMQ running on ``sap-p1-1``. Its outbound ``REPLICATION_CHANNEL`` will be ``tokyo2020-replica`` and uses the RabbitMQ running in ap-northeast-1, using an SSH port forward with local port 5673 for the ap-northeast-1 RabbitMQ (15673 for the web administration UI). A reverse port forward from ap-northeast-1 to the application port 8888 on ``sap-p1-2`` has to be established which replicas running in ap-northeast-1 will use to reach their master through HTTP. This way, adding more replicas on the AWS side in the cloud will not require any additional bandwidth between cloud and on-site network, except that the reverse HTTP channel, which uses only little traffic, will see additional traffic per replica whereas all outbound replication goes to the single exchange in the RabbitMQ node running in ap-northeast-1.
583
-
584
-## User Groups and Permissions
585
-
586
-The general public shall not be allowed during the live event to browse the event through ``tokyo2020.sapsailing.com``. Instead, they are required to go through any of the so-called "Rights-Holding Broadcaster" (RHB) web sites. There, a "widget" will be embedded into their web sites which works with our REST API to display links to the regattas and races, in particular the RaceBoard.html pages displaying the live and replay races.
587
-
588
-Moderators who need to comment on the races shall be given more elaborate permissions and shall be allowed to use the full-fledged functionality of ``tokyo2020.sapsailing.com``, in particular, browse through all aspects of the event, see flag statuses, postponements and so on.
589
-
590
-To achieve this effect, the ``tokyo2020-server`` group has the ``sailing_viewer`` role assigned for all users, and all objects, except for the top-level ``Event`` object are owned by that group. This way, everything but the event are publicly visible.
591
-
592
-The ``Event`` object is owned by ``tokyo2020-moderators``, and that group grants the ``sailing_viewer`` role only to its members, meaning only the members of that group are allowed to see the ``Event`` object.
593
-
594
-## Landscape Upgrade Procedure
595
-
596
-In the ``configuration/on-site-scripts`` we have prepared a number of scripts intended to be useful for local and cloud landscape management. TL;DR:
597
-```
598
- configuration/on-site-scripts/upgrade-landscape.sh -R {release-name} -b {replication-bearer-token}
599
-```
600
-will upgrade the entire landscape to the release ``{release-name}`` (e.g., build-202107210711). The ``{replication-bearer-token}`` must be provided such that the user authenticated by that token will have the permission to stop replication and to replicate the ``tokyo2020`` master.
601
-
602
-The script will proceed in the following steps:
603
- - patch ``*.conf`` files in ``sap-p1-1:servers/[master|security_service]`` and ``sap-p1-2:servers/[replica|master|security_service]`` so
604
- their ``INSTALL_FROM_RELEASE`` points to the new ``${RELEASE}``
605
- - Install new releases to ``sap-p1-1:servers/[master|security_service]`` and ``sap-p1-2:servers/[replica|master|security_service]``
606
- - Update all launch configurations and auto-scaling groups in the cloud (``update-launch-configuration.sh``)
607
- - Tell all replicas in the cloud to stop replicating (``stop-all-cloud-replicas.sh``)
608
- - Tell ``sap-p1-2`` to stop replicating
609
- - on ``sap-p1-1:servers/master`` run ``./stop; ./start`` to bring the master to the new release
610
- - wait until master is healthy
611
- - on ``sap-p1-2:servers/replica`` run ``./stop; ./start`` to bring up on-site replica again
612
- - launch upgraded cloud replicas and replace old replicas in target group (``launch-replicas-in-all-regions.sh``)
613
- - terminate all instances named "SL Tokyo2020 (auto-replica)"; this should cause the auto-scaling group to launch new instances as required
614
- - manually inspect the health of everything and terminate the "SL Tokyo2020 (Upgrade Replica)" instances when enough new instances
615
- named "SL Tokyo2020 (auto-replica)" are available
616
-
617
-The individual scripts will be described briefly in the following sub-sections. Many of them use as a common artifact the ``regions.txt`` file which contains the list of regions in which operations are executed. The ``eu-west-1`` region as our "legacy" or "primary" region requires special attention in some cases. In particular, it can use the ``live`` replica set for the replicas started in the region, also because the AMI used in this region is slightly different and in particular doesn't launch a MongoDB local replica set on each instance which the AMIs in all other regions supported do.
618
-
619
-### clone-security-service-db-safe-exit
620
-
621
-Creates a ``mongodump`` of "mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com,dbserver.internal.sapsailing.com:10203/security_service?replicaSet=live&retryWrites=true&readPreference=nearest" on the ``tokyo-ssh.sapsailing.com`` host and packs it into a ``.tar.gz`` file. This archive is then transferred as the standard output of an SSH command to the host executing the script where it is unpacked into ``/tmp/dump``. The local "mongodb://localhost/security_service_bak?replicaSet=security_service&retryWrites=true&readPreference=nearest" backup copy is then dropped, the local ``security_service`` DB is moved to ``security_service_bak``, and the dump from ``/tmp/dump`` is then restored to ``security_service``. If this fails, the backup from ``security_service_bak`` is restored to ``security_service``, and there won't be a backup copy anymore in ``security_service_bak`` anymore.
622
-
623
-The script is used as a CRON job for user ``sailing@sap-p1-1``.
624
-
625
-### get-replica-ips
626
-
627
-Lists the public IP addresses of all running replicas in the regions described in ``regions.txt`` on its standard output. Progress information will be sent to standard error. Example invocation:
628
-<pre>
629
- $ ./get-replica-ips
630
- Region: eu-west-1
631
- Region: ap-northeast-1
632
- Region: ap-southeast-2
633
- Region: us-west-1
634
- Region: us-east-1
635
- 34.245.148.130 18.183.234.161 3.26.60.130 13.52.238.81 18.232.169.1
636
-</pre>
637
-
638
-### launch-replicas-in-all-regions.sh
639
-
640
-Will launch as many new replicas in the regions listed in ``regions.txt`` with the release specified with ``-R`` as there are currently healthy auto-replicas registered with the ``S-ded-tokyo2020`` target group in the region (at least one) which will register at the master proxy ``tokyo-ssh.internal.sapsailing.com:8888`` and RabbitMQ at ``rabbit-ap-northeast-1.sapsailing.com:5672``, then when healthy get added to target group ``S-ded-tokyo2020`` in that region, with all auto-replicas registered before removed from the target group.
641
-
642
-The script uses the ``launch-replicas-in-region.sh`` script for each region where replicas are to be launched.
643
-
644
-Example invocation:
645
-<pre>
646
- launch-replicas-in-all-regions.sh -R build-202107210711 -b 1234567890ABCDEFGH/+748397=
647
-</pre>
648
-
649
-Invoke without arguments to see a documentation of possible parameters.
650
-
651
-### launch-replicas-in-region.sh
652
-
653
-Will launch one or more (see ``-c``) new replicas in the AWS region specified with ``-g`` with the release specified with ``-R`` which will register at the master proxy ``tokyo-ssh.internal.sapsailing.com:8888`` and RabbitMQ at ``rabbit-ap-northeast-1.sapsailing.com:5672``, then when healthy get added to target group ``S-ded-tokyo2020`` in that region, with all auto-replicas registered before removed from the target group. Specify ``-r`` and ``-p`` if you are launching in ``eu-west-1`` because it has a special non-default MongoDB environment.
654
-
655
-Example invocation:
656
-<pre>
657
- launch-replicas-in-region.sh -g us-east-1 -R build-202107210711 -b 1234567890ABCDEFGH/+748397=
658
-</pre>
659
-
660
-Invoke without arguments to see a documentation of possible parameters.
661
-
662
-### stop-all-cloud-replicas.sh
663
-
664
-Will tell all replicas in the cloud in those regions described by the ``regions.txt`` file to stop replicating. This works by invoking the ``get-replica-ips script`` and for each of them to stop replicating, using the ``stopReplicating.sh`` script in their ``/home/sailing/servers/tokyo2020`` directory, passing through the bearer token. Note: this will NOT stop replication on the local replica on ``sap-p1-2``!
665
-
666
-The script must be invoked with the bearer token needed to authenticate a user with replication permission for the ``tokyo2020`` application replica set.
667
-
668
-Example invocation:
669
-<pre>
670
- stop-all-cloud-replicas.sh -b 1234567890ABCDEFGH/+748397=
671
-</pre>
672
-
673
-Invoke without arguments to see a documentation of possible parameters.
674
-
675
-### update-launch-configuration.sh
676
-
677
-Will upgrade the auto-scaling group ``tokyo2020*`` (such as ``tokyo2020-auto-replicas``) in the regions from ``regions.txt`` with a new launch configuration that will be derived from the existing launch configuration named ``tokyo2020-*`` by copying it to ``tokyo2020-{RELEASE_NAME}`` while updating the ``INSTALL_FROM_RELEASE`` parameter in the user data to the ``{RELEASE_NAME}`` provided in the ``-R`` parameter, and optionally adjusting the AMI, key pair name and instance type if specified by the respective parameters. Note: this will NOT terminate any instances in the target group!
678
-
679
-Example invocation:
680
-<pre>
681
- update-launch-configuration.sh -R build-202107210711
682
-</pre>
683
-
684
-Invoke without arguments to see a documentation of possible parameters.
685
-
686
-### upgrade-landscape.sh
687
-
688
-See the introduction of this main section. Synopsis:
689
-<pre>
690
- ./upgrade-landscape.sh -R &lt;release-name&gt; -b &lt;replication-bearer-token&gt; \[-t &lt;instance-type&gt;\] \[-i &lt;ami-id&gt;\] \[-k &lt;key-pair-name&gt;\] \[-s\]<br>
691
- -b replication bearer token; mandatory
692
- -i Amazon Machine Image (AMI) ID to use to launch the instance; defaults to latest image tagged with image-type:sailing-analytics-server
693
- -k Key pair name, mapping to the --key-name parameter
694
- -R release name; must be provided to select the release, e.g., build-202106040947
695
- -t Instance type; defaults to
696
- -s Skip release download
697
-</pre>
698
-
699
-## Log File Analysis
700
-
701
-Athena table definitions and queries have been provided in region ``eu-west-3`` (Paris) where we hosted our EU part during the event after a difficult start in ``eu-west-1`` with the single MongoDB live replica set not scaling well for all the replicas that were required in the region.
702
-
703
-The key to the Athena set-up is to have a table definition per bucket, with a dedicated S3 bucket per region where ALB logs were recorded. An example of a query based on the many tables the looks like this:
704
-<pre>
705
- with union_table AS
706
- (select *
707
- from alb_logs_ap_northeast_1
708
- union all
709
- select *
710
- from alb_logs_ap_southeast_2
711
- union all
712
- select *
713
- from alb_logs_eu_west_3
714
- union all
715
- select *
716
- from alb_logs_us_east_1
717
- union all
718
- select *
719
- from alb_logs_us_west_1)
720
- select date_trunc('day', parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')), count(distinct concat(client_ip,user_agent))
721
- from union_table
722
- where (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
723
- between parse_datetime('2021-07-21-00:00:00','yyyy-MM-dd-HH:mm:ss')
724
- and parse_datetime('2021-08-08-02:00:00','yyyy-MM-dd-HH:mm:ss'))
725
- group by date_trunc('day', parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z'))
726
-</pre>
727
-It defines a ``union_table`` which unites all contents from all buckets scanned.
... ...
\ No newline at end of file
wiki/info/landscape/paris2024/olympic-plan-for-paris-marseille-2024.md
... ...
@@ -0,0 +1,149 @@
1
+# Thoughts on Landscape Configuration for Paris 2024 / Marseille
2
+
3
+As a baseline we'll use the [Olympic Setup From Tokyo 2020](/wiki/info/landscape/tokyo2020/olympic-setup). The major change, though, would be that instead of running a local on-site master and a local on-site replica we would run two master instances locally on site where one is the "shadow" and the other one is the "production" master.
4
+
5
+We captured a set of scripts and configuration files in out Git repository at ``configuration/on-site-scripts``, in particular also separately for the two laptops, in ``configuration/on-site-scripts/sap-p1-1`` and ``configuration/on-site-scripts/sap-p1-2``.
6
+
7
+Many of these scripts and configuration files contain an explicit reference to the replica set name (and therefore sub-domain name, DB name, tag values, etc.) ``tokyo2020``. With the test event up in July 2023 and the Paris Olympic Summer Games 2024 we should consider making this a parameter of these scripts so it is easy to adjust. We will need different sub-domains for the test event and the Games where the latter most likely will have ``paris2024.sapsailing.com`` as its domain name and hence ``paris2024`` as the replica set name.
8
+
9
+## VPCs and VPC Peering
10
+
11
+From Tokyo2020 we still have the VPCs around in five regions (``eu-west-3``, ``us-west-1``, ``us-east-1``, ``ap-northeast-1``, and ``ap-southeast-2``). They were named ``Tokyo2020`` and our scripts currently depend on this. But VPCs can easily be renamed, and with that we may save a lot of work regarding re-peering those VPCs. We will, though need routes to the new "primary" VPC ``eu-west-3`` from everywhere because the ``paris-ssh.sapsailing.com`` jump host will be based there. Note the inconsistency in capitalization: for the VPC name and as part of instance names such as ``SL Tokyo2020 (Upgrade Replica)`` we use ``Tokyo2020``, for basically everything else it's ``tokyo2020`` (lowercase). When switching to a parameterized approach we should probably harmonize this and use the lowercase name consistently throughout.
12
+
13
+I've started with re-naming the VPCs and their routing tables from ``Tokyo2020`` to ``Paris2024``. I've also added VPC peering between Paris (``eu-west-3``) and California (``us-west-1``), Virginia (``us-east-1``), and Sydney (``ap-southeast-2``). The peering between Paris and Tokyo (``ap-northeast-1``) already existed because for Tokyo 2020, Paris hosted replicas that needed to access the jump host in the Tokyo region.
14
+
15
+I've also copied the "SAP Sailing Analytics 1.150" image to all five regions.
16
+
17
+## Master and Shadow Master
18
+
19
+We will use one laptop as production master, the other as "shadow master." The reason for not using a master and a local replica is that if the local master fails, re-starting later in the event can cause significant delays until all races have loaded and replicated again.
20
+
21
+Both laptops shall run their local RabbitMQ instance. Each of the two master processes can optionally write into its local RabbitMQ through an SSH tunnel which may instead redirect to the cloud-based RabbitMQ for an active Internet/Cloud connection.
22
+
23
+This will require to set up two MongoDB databases (not separate processes, just different DB names), e.g., "paris2024" and "paris2024-shadow". Note that for the shadow master this means that the DB name does not follow the typical naming convention where the ``SERVER_NAME`` property ("paris2024" for both, the primary and the shadow master) also is used as the default MongoDB database name.
24
+
25
+Note: The shadow master must have at least one registered replica because otherwise it would not send any operations into the RabbitMQ replication channel. This can be a challenge for a shadow master that has never seen any replica. We could, for example, simulate a replica registration when the shadow master is still basically empty, using, e.g., a CURL request and then ignoring and later deleting the initial load queue on the local RabbitMQ.
26
+
27
+Furthermore, the shadow master must not send into the production RabbitMQ replication channel that is used by the production master instance while it is not in production itself, because it would duplicate the operations sent. Instead, the shadow master shall use a local RabbitMQ instance to which an SSH tunnel forwards.
28
+
29
+We will install a cron job that regularly performs a "compareServers" between production and shadow master. Any deviation shall be notified using the e-mail notification mechanism in place for all other alerts and monitoring activities, too.
30
+
31
+## Cloud RabbitMQ
32
+
33
+Instead of ``rabbit-ap-northeast-1.sapsailing.com`` we will use ``rabbit-eu-west-3.sapsailing.com`` pointing to the internal IP address of the RabbitMQ installation in ``eu-west-3`` that is used as the default for the on-site master processes as well as for all cloud replicas.
34
+
35
+## ALB and Target Group Set-Up
36
+
37
+Like for Tokyo2020, a separate ALB for the Paris2024 event will be set up in each of the regions supported. They will all be registered with the Global Accelerator to whose anycast-IP adresses the DNS alias record for ``paris2024.sapsailing.com`` will point. Different from Tokyo2020 where we used a static "404 - Not Found" rule as the default rule for all of these ALBs, we can and should use an IP-based target group for the default rule's forwarding and should registed the ``eu-west-1`` "Webserver" (Central Reverse Proxy)'s internal IP address in these target groups. This way, when archiving the event, cached DNS records can still resolve to the Global Accelerator and from there to the ALB(s) and from there, via these default rules, back to the central reverse proxy which then should now where to find the ``paris2024.sapsailing.com`` content in the archive.
38
+
39
+Target group naming conventions have changed slightly since Tokyo2020: instead of ``S-ded-tokyo2020`` we will use only ``S-paris2024`` for the public target group containing all the cloud replicas.
40
+
41
+## Cloud Replica Set-Up
42
+
43
+Based on the cloud replica set-up for Tokyo2020 we can derive the following user data for Paris2024 cloud replicas:
44
+
45
+```
46
+INSTALL_FROM_RELEASE=build-.............
47
+SERVER_NAME=paris2024
48
+MONGODB_URI="mongodb://localhost/paris2024-replica?replicaSet=replica&retryWrites=true&readPreference=nearest"
49
+USE_ENVIRONMENT=live-replica-server
50
+REPLICATION_CHANNEL=paris2024-replica
51
+REPLICATION_HOST=rabbit-eu-west-3.sapsailing.com
52
+REPLICATE_MASTER_SERVLET_HOST=paris-ssh.internal.sapsailing.com
53
+REPLICATE_MASTER_SERVLET_PORT=8888
54
+REPLICATE_MASTER_EXCHANGE_NAME=paris2024
55
+REPLICATE_MASTER_QUEUE_HOST=rabbit-eu-west-3.sapsailing.com
56
+REPLICATE_MASTER_BEARER_TOKEN="***"
57
+ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
58
+```
59
+
60
+Make sure to align the ``INSTALL_FROM_RELEASE`` parameter to match up with the release used on site.
61
+
62
+## SSH Tunnels
63
+
64
+The baseline is again the Tokyo 2020 set-up. Besides the jump host's re-naming from ``tokyo-ssh.sapsailing.com`` to ``paris-ssh.sapsailing.com``. The tunnel scripts for ``sap-p1-2`` that assume ``sap-p1-2`` is (primary) master seem to be faulty. At least, they don't establish a reverse port forward for port 8888 which, however, seems necessary to let cloud replicas reach the on-site master. ``sap-p1-2`` becoming (primary) on-site master means that ``sap-p1-1`` has failed. This can be a problem with the application process but could even be a hardware issue where the entire machine has crashed and has become unavailable. Therefore, ``sap-p1-2`` must take over at least the application and become primary master, and this requires the reverse port forward like this: ``-R '*:8888:localhost:8888'``
65
+
66
+The ports and their semantics:
67
+
68
+* 443: HTTPS port of security-service.sapsailing.com (or its local replacement through NGINX)
69
+* 5673: Outbound RabbitMQ to use by on-site master (regularly to RabbitMQ in eu-west-3, local replacement as fallback)
70
+* 5675: Inbound RabbitMQ (rabbit.internal.sapsailing.com) for replication from security-service.sapsailing.com (or local replacement)
71
+* 9443: NGINX HTTP port on sap-p1-1 (also reverse-forwarded from paris-ssh.sapsailing.com)
72
+* 9444: NGINX HTTP port on sap-p1-2 (also reverse-forwarded from paris-ssh.sapsailing.com)
73
+* 10201: MongoDB on sap-p1-1
74
+* 10202: MongoDB on sap-p1-2
75
+* 10203: MongoDB on paris-ssh.sapsailing.com
76
+* 15673: HTTP to RabbitMQ administration UI of the RabbitMQ server reached on port 5673
77
+* 15675: HTTP to RabbitMQ administration UI of the RabbitMQ server reached on port 5675
78
+* 22222: SSH access to sapsailing.com:22, e.g., for Git access through ``ssh://trac@localhost:22222/home/trac/git``
79
+* 22443: HTTPS access to sapsailing.com:443, e.g., for trying to download a release, although chances are slim this works without local ``/etc/hosts`` magic, e.g., for ``releases.sapsailing.com``
80
+
81
+``/etc/hosts`` must map ``security-service.sapsailing.com`` to ``localhost`` so that local port 443 can be forwarded to different targets based on needs.
82
+
83
+### Regular Operations
84
+
85
+* Three MongoDB nodes form the ``paris2024`` replica set: ``sap-p1-1:10201``, ``sap-p1-2:10202``, and ``paris-ssh.sapsailing.com:10203``, where SSH tunnels forward ports 10201..10203 such that everywhere on the three hosts involved the replica set can be addressed as ``mongodb://localhost:10201,localhost:10202,localhost:10203/?replicaSet=paris2024&retryWrites=true&readPreference=nearest``
86
+* ``sap-p1-1`` runs the ``paris2024`` production master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which tunnels to ``rabbit-eu-west-3.sapsailing.com`` whose admin UI is reached through port 15673 which tunnels to ``rabbit-eu-west-3.sapsailing.com:15672``
87
+* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which tunnels to the RabbitMQ running locally on ``sap-p1-2``, port 5672 whose admin UI is then reached through port 15673 which tunnels to ``sap-p1-2:15672``
88
+* The database ``mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com/security_service?replicaSet=live`` is backed up on a regular basis (nightly) to the local MongoDB replica set ``paris2024`` DB named ``security_service`` which makes it visible especially in the two MongoDB replicas running on ``sap-p1-1`` and ``sap-p1-2``
89
+
90
+### Production Master Failure
91
+
92
+Situation: production master fails, e.g., because of a Java VM crash or a deadlock or user issues such as killing the wrong process...
93
+
94
+Approach: Switch to previous shadow master on ``sap-p1-2``, re-configuring all SSH tunnels accordingly; this includes the 8888 reverse forward from the cloud to the local on-site master, as well as the RabbitMQ forward which needs to switch from the local RabbitMQ running on the shadow master's host to the cloud-based RabbitMQ. Clients such as SwissTiming clients need to switch to the shadow master. To remedy gaps in replication due to the SSH tunnel switch we may want to circulate the replica instances, rolling over to a new set of replicas that fetch a new initial load. If ``sap-p1-1``'s operating system is still alive, its SSH tunnel especially for port 8888 reverse forwarding from ``paris-ssh.sapsailing.com`` must be terminated because otherwise ``sap-p1-2`` may not be able to establish its according reverse forward of port 8888.
95
+
96
+Here are the major changes:
97
+
98
+* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``security-service.sapsailing.com`` (which actually forwards to the ALB hosting the rules for ``security-service.sapsailing.com`` and RabbitMQ ``rabbit.internal.sapsailing.com`` tunneled through port 5675, with the RabbitMQ admin UI tunneled through port 15675; *outbound replication goes to local port 5673 which tunnels to* ``rabbit-eu-west-3.sapsailing.com`` *whose admin UI is reached through port 15673 which tunnels to* ``rabbit-eu-west-3.sapsailing.com:15672``
99
+
100
+### Internet Failure
101
+
102
+While cloud replicas and hence the ALBs and Global Accelerator will remain reachable with the latest data snapshot at the time the connection is lost, we will then lose the following capabilities:
103
+
104
+* replicate the official ``security-service.sapsailing.com`` service, both, from an HTTP as well as a RabbitMQ perspective; ``rabbit.internal.sapsailing.com`` will then no longer be reachable from the on-site network
105
+* keep the cloud MongoDB instance on ``paris-ssh.sapsailing.com`` synchronized; it will fall behind
106
+* outbound replication to ``rabbit-eu-west-3.sapsailing.com`` and from there on to the cloud replicas in all regions supported will stop
107
+* inbound "reverse" replication from the cloud replicas to the on-site master through the reverse forward of ``paris-ssh.sapsailing.com:8888`` will stop working; the cloud replicas will start buffering the operations to send to their master and will keep re-trying in growing time intervals
108
+
109
+To recover with as little disruption as possible, switching to a local copy of the ``security-service`` and to a local RabbitMQ for "outbound" replication is required. Of course, no replicas will be listening on that local RabbitMQ, but in order to not stop working, the application server will need a RabbitMQ that can be reached on the outbound port 5673. This is achieved by switching the SSH tunnel such that port 5673 will then forward to a RabbitMQ running locally.
110
+
111
+We will then start ``sap-p1-1:/home/sailing/servers/security_service`` on port 8889 which will connect to the local MongoDB replica set still consisting of the two on-site nodes, using the database ``security_service`` that has been obtained as a copy of the ``live`` MongoDB replica set in our default region. This local security service uses the local RabbitMQ running on the same host for its outbound replication. On both on-site laptops the port 443 then needs to forward to the NGINX instance running locally as a reverse proxy for the local security service. On ``sap-p1-1`` this is port 9443, on ``sap-p1-2`` this is port 9444. Furthermore, the port forward from port 5675 and 15675 on both laptops then must point to the local RabbitMQ used outbound by the security service running locally. This will usually be the RabbitMQ running on ``sap-p1-1``, so ``sap-p1-1:5672``, or ``sap-p1-1:15672``, respectively, for the admin port.
112
+
113
+This makes for the following set-up:
114
+
115
+* Only two MongoDB nodes remain available on site from the ``paris2024`` replica set: ``sap-p1-1:10201`` and ``sap-p1-2:10202``, where SSH tunnels forward ports 10201..10203 such that everywhere on the three hosts involved the replica set can be addressed as ``mongodb://localhost:10201,localhost:10202,localhost:10203/?replicaSet=paris2024&retryWrites=true&readPreference=nearest``
116
+* ``sap-p1-1`` runs the ``paris2024`` production master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``sap-p1-1:9443`` which is the port of the local NGINX acting as an SSL-offloading reverse proxy for the security service running locally on port 8889; port 5675 forwards to ``sap-p1-1:5672`` where the local RabbitMQ runs, with the local ``sap-p1-1`` RabbitMQ admin UI tunneled through port 15675; outbound replication goes to local port 5673 which then also tunnels to the local RabbitMQ on ``sap-p1-1:5672``, whose admin UI is reached through port 15673 which tunnels to ``sap-p1-1:15672``
117
+* ``sap-p1-2`` runs the ``paris2024`` shadow master from ``/home/sailing/servers/paris2024`` against local database ``paris2024:paris2024-shadow``, replicating from ``security-service.sapsailing.com`` through SSH tunnel from local port 443 pointing to ``sap-p1-1:9443`` which is the reverse proxy for the security service running on ``sap-p1-1:8889``, and RabbitMQ tunneled through port 5675 to ``sap-p1-1:5672``, with the RabbitMQ admin UI tunneled through port 15675 to ``sap-p1-1:15672``; outbound replication still goes to local port 5673 which tunnels to the RabbitMQ running locally on ``sap-p1-2``, port 5672 whose admin UI is then reached through port 15673 which tunnels to ``sap-p1-2:15672`` which keeps the shadow master's outbound replication from interfering with the production master's outbound replication.
118
+
119
+### Internet Failure Using Shadow Master
120
+
121
+
122
+
123
+## Test Plan for Test Event Marseille July 2023
124
+
125
+### Test Internet Failure
126
+
127
+We shall emulate the lack of a working Internet connection and practice and test the procedures for switching to a local security-service.sapsailing.com installation as well as a local RabbitMQ standing in for the RabbitMQ deployed in the cloud.
128
+
129
+### Test Primary Master Hardware Failure
130
+
131
+This will require switching entirely to the shadow master. Depending on the state of the reverse port forward of the 8888 HTTP port from the cloud we may or may not have to try to terminate a hanging connection in order to be able to establish a new reverse port forward pointing from the cloud to the shadow master. The shadow master also then needs to use the cloud-based RabbitMQ instead of its local one. As a fine-tuning, we can practice the rolling re-sync of all cloud replicas which will likely have missed operations in the meantime.
132
+
133
+### Test Primary Master Java VM Failure
134
+
135
+This can be caused by a deadlock, VM crash, Full GC phase, massive performance degradation or other faulty behavior. We then need to actively close the reverse SSH port forward from the cloud to the production master's 8888 HTTP port, as a precaution switch the RabbitMQ tunnel from the cloud-based to the local RabbitMQ instance so that in case the production master "wakes up" again, e.g., after a Full GC, it does not start to interfere with the now active shadow master on the RabbitMQ fan-out exchange. On the shadow master we need to re-configure the SSH tunnels, particularly to target the cloud-based RabbitMQ and have the reverse port forward on port 8888 target the shadow master on site now.
136
+
137
+### Test Primary Mater Failures with no Internet Connection
138
+
139
+Combine the above scenarios: a failing production master (hardware or VM-only) will require different tunnel re-configurations, especially regarding the then local security-service.sapsailing.com environment which may need to move to the shadow laptop.
140
+
141
+## TODO Before / During On-Site Set-Up (Both, Test Event and OSG2024)
142
+
143
+* Set up Global Accelerator and have the already established DNS record ``paris2024.sapsailing.com`` (placeholder that points to the Dynamic ALB in the default region ``eu-west-1`` to effectively forward to the central reverse proxy and ultimately the archive server's landing page) become an alias pointing to this Global Accelerator
144
+* Set up logging buckets for ALBs in all supported regions
145
+* Set up ALBs in all supported regions, define their three rules (redirect for ``paris2024.sapsailing.com/`` path; forward to public target group for all other ``paris2024.sapsailing.com`` traffic; default rule forwarding to IP-based target group containing the ``eu-west-1`` central reverse proxy) and register them with the Global Accelerator
146
+* Add SSH public keys for password-less private keys of ``sap-p1-1`` and ``sap-p1-2`` to ``ec2-user@paris-ssh.sapsailing.com:.ssh/authorized_keys.org`` so that when the authorized_keys file is updated automatically, the on-site keys are still preserved.
147
+* Create LetsEncrypt certificates for the NGINX installations for paris2024.sapsailing.com and security-service.sapsailing.com and install to the two on-site laptops' NGINX environments
148
+* Ensure the MongoDB installations on both laptops use the ``paris2024`` replica set
149
+* Adjust Athena queries to include all ALB logging buckets from all regions
wiki/info/landscape/tokyo2020/olympic-failover.md
... ...
@@ -0,0 +1,121 @@
1
+# Failover Scenarios for [Olympic Setup Tokyo 2020](https://wiki.sapsailing.com/wiki/info/landscape/tokyo2020/olympic-setup)
2
+
3
+This page is meant to describe a couple of failure scenarios and appropiate mitigation approaches. In addition open questions are documented. It refers to the setup described in detail [here](https://wiki.sapsailing.com/wiki/info/landscape/tokyo2020/olympic-setup). It is work in progress and far from complete.
4
+
5
+## Hardware failure on primary Lenovo P1 with the Sailing Analytics Master
6
+
7
+### Scenario
8
+
9
+The Lenovo P1 device on which the SAP Sailing Analytics Master / Primary is running fails and is not available anymore. Reason could be a hardware failure on the CPU e.g.
10
+
11
+The local replica on the second P1 device is still available, also the cloud replicas are still available. Data from the past is still available to users and on-site consumers. However no new data will be stored.
12
+
13
+The mongodb replicaset member running on the primary P1 will also be not available anymore.
14
+
15
+### Mitigation
16
+
17
+The second P1 needs to switch role to be a master. This means a new SAP Sailing Analytics Master process has to be started from scratch. The outbound replication channel has to be the cloud rabbit in ap-northeast-1.
18
+
19
+Swisstiming needs to be informed once the master is ready.
20
+
21
+Cloud replicas need to be reconfigured to the new channel that the master uses as outbound.
22
+
23
+SSH tunnels won't need to change.
24
+
25
+The local replica has to be safely shut down, on-site users might experience some change, depending on how we decide with local routing.
26
+
27
+Alternatively, based on the Tokyo 2020 experience, we may consider running the second Lenovo P1 laptop also in "master/primary" mode as a "shadow" where all we need to focus on is initially connecting the TracTrac races and linking them properly to the leaderboard slots. From there on, administration other than adding or removing wind sources, proved to be low effort and close to zero interaction. The official scores are transmitted after confirmation from TracTrac, and so are all start time, finish times, and penalties. This approach could help reduce the time to fail over from the primary to the shadow system in a lot less time than would be required for a re-start of the second Lenovo P1 in master/primary mode.
28
+
29
+### Open questions
30
+
31
+How exactly has the switch to happen?
32
+
33
+Do we re-use the port?
34
+
35
+What are pro's and con's?
36
+
37
+I would like to see a somewhat detailed run-book.
38
+
39
+### Results and procedure tested in Medemblik 2021
40
+
41
+Check master.conf in ``/home/sailing/server/master`` on sap-p1-2 for the desired build and correct setting of the new variable ``INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT``. There is a tunnel listening on 22222 which forwards the traffic through tokyo-ssh to sapsailing.com:22.
42
+So a valid entry would be:
43
+```
44
+INSTALL_FROM_RELEASE=build-202106012325
45
+INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT="trac@localhost:22222"
46
+```
47
+
48
+Now execute, this will download and extract the build:
49
+```
50
+cd /home/sailing/servers/master; rm env.sh; cat master.conf | ./refreshInstance.sh auto-install-from-stdin
51
+```
52
+
53
+Now we stop the replica, make sure you are user ``sailing``:
54
+```
55
+/home/sailing/servers/replica/stop
56
+```
57
+
58
+Wait for process to be stopped/killed.
59
+
60
+Start the correct tunnels script by executing:
61
+```
62
+sudo /usr/local/bin/tunnels-master
63
+```
64
+
65
+Start the master, make sure you are user ``sailing``!
66
+```
67
+/home/sailing/servers/master/start
68
+```
69
+
70
+Check sailing log:
71
+```
72
+tail -f /home/sailing/servers/master/logs/sailing0.log.0
73
+```
74
+
75
+## Hardware failure on secondary Lenovo P1 with the Sailing Analytics Replica
76
+
77
+### Scenario
78
+
79
+The secondary Lenovo P1 experience an unrecoverable hardware failure. The primary P1 is still available, new data is safely stored in the database.
80
+
81
+Cloud users are not able to see new data, as the replication channel is interrupted.
82
+
83
+For on-site/local users this depends on the decision which URL/IP they are provided with. If we use the local replica to serve them then in this scenario their screens will turn dark.
84
+
85
+The mongodb replicaset member will not be available anymore.
86
+
87
+### Mitigation
88
+
89
+Local/on-site users need to have priority. If they were served by the secondary P1 before they need to switch to the primary P1.
90
+
91
+The outbound replication channel on the primary P1 need to switch to the rabbit in ap-northeast-1. The cloud replicas need to be reconfigured to use that channel.
92
+
93
+SSH tunnels won't need to change.
94
+
95
+### Open questions
96
+
97
+How do local/on-site users use the Sailing Analytics? Will they simply be served from tokyo2020.sapsailing.com? A couple of decisions depend on this question.
98
+
99
+What exactly needs to be done where to change the replication and be sure that it will work without data loss? I would suggest a test of at least one of the described scenarios in Medemblik and create a runbook.
100
+
101
+## Internet failure on Enoshima site
102
+
103
+### Scenario
104
+
105
+Internet connectivity is not given anymore at Enoshima On-site.
106
+
107
+### Open questions
108
+
109
+How will local/on-site users be connected to the local P1s, assuming that the LAN is still working?
110
+
111
+Would we try to provide connectivity through mobile hotspots, as auto SSH should reliably start working again once it reaches the target IPs? Shall we leave this issue to Swisstiming/Organizers and stick to the local connections to the Sailing Analytics?
112
+
113
+## TracTrac in the Cloud?
114
+
115
+### Scenario
116
+
117
+On-site Internet goes down; does TracTrac have a full-fledged server running in the cloud that we could connect to from the cloud to at least keep serving the RHBs?
118
+
119
+### Open questions
120
+
121
+How can the MongoDB in the cloud be re-configured dynamically to become primary even though it may have been started with priority 0?
wiki/info/landscape/tokyo2020/olympic-setup.md
... ...
@@ -0,0 +1,727 @@
1
+# Setup for the Olympic Summer Games 2020/2021 Tokyo
2
+
3
+[[_TOC_]]
4
+
5
+## Local Installation
6
+
7
+For the Olympic Summer Games 2020/2021 Tokyo we use a dedicated hardware set-up to accommodate the requirements on site. In particular, two Lenovo P1 laptops with equal hardware configuration (32GB RAM, Intel Core i9-9880H) will be established as server devices running various services in a way that we can tolerate, with minimal downtimes, failures of either of the two devices.
8
+
9
+### Installation Packages
10
+
11
+The two laptops run Mint Linux with a fairly modern 5.4 kernel. We keep both up to date with regular ``apt-get update && apt-get upgrade`` executions. Both have an up-to-date SAP JVM 8 (see [https://tools.hana.ondemand.com/#cloud](https://tools.hana.ondemand.com/#cloud)) installed under /opt/sapjvm_8. This is the runtime VM used to run the Java application server process.
12
+
13
+Furthermore, both laptops have a MongoDB 3.6 installation configured through ``/etc/apt/sources.list.d/mongodb-org-3.6.list`` containing the line ``deb http://repo.mongodb.org/apt/debian jessie/mongodb-org/3.6 main``. Their respective configuration can be found under ``/etc/mongod.conf``. The WiredTiger storage engine cache size should be limited. Currently, the following entry in ``/etc/mongod.conf`` does this.
14
+
15
+RabbitMQ is part of the distribution natively, in version 3.6.10-1. It runs on both laptops. Both, RabbitMQ and MongoDB are installed as systemd service units and are launched during the boot sequence. The latest GWT version (currently 2.9.0) is installed under ``/opt/gwt-2.9.0`` in case any development work would need to be done on these machines.
16
+
17
+Both machines have been configured to use 2GB of swap space at ``/swapfile``.
18
+
19
+### Mongo Configuration
20
+
21
+On both laptops, the ``/etc/mongod.conf`` configuration configures ``/var/lib/mongodb`` to be the storage directory, and the in-memory cache size to be 2GB:
22
+
23
+```
24
+storage:
25
+ dbPath: /var/lib/mongodb
26
+ journal:
27
+ enabled: true
28
+ wiredTiger:
29
+ engineConfig:
30
+ cacheSizeGB: 2
31
+```
32
+
33
+The port is set to ``10201`` on ``sap-p1-1``:
34
+
35
+```
36
+# network interfaces
37
+net:
38
+ port: 10201
39
+ bindIp: 0.0.0.0
40
+```
41
+
42
+and to ``10202`` on ``sap-p1-2``:
43
+
44
+```
45
+# network interfaces
46
+net:
47
+ port: 10202
48
+ bindIp: 0.0.0.0
49
+```
50
+
51
+Furthermore, the replica set is configured to be ``tokyo2020`` on both:
52
+
53
+```
54
+replication:
55
+ oplogSizeMB: 10000
56
+ replSetName: tokyo2020
57
+```
58
+
59
+On both laptops we have a second MongoDB configuration ``/etc/mongod-security-service.conf`` which is used by the ``/lib/systemd/system/mongod-security-service.service`` unit which we created as a copy of ``/lib/systemd/system/mongod.service`` and adjusted the line
60
+
61
+```
62
+ExecStart=/usr/bin/mongod --config /etc/mongod-security-service.conf
63
+```
64
+
65
+This second database runs as a replica set ``security_service`` on the default port 27017 and is used as the target for a backup script for the ``security_service`` database. See below. We increased the priority of the ``sap-p1-1`` node from 1 to 2.
66
+
67
+### User Accounts
68
+
69
+The essential user account on both laptops is ``sailing``. The account is intended to be used for running the Java VM that executes the SAP Sailing Analytics server software. The account is currently still protected by a password that our on-site team should know. On both laptops the ``sailing`` account has a password-less SSH key installed under ``/home/sailing/.ssh`` that is contained in the ``known_hosts`` file of ``tokyo-ssh.sapsailing.com`` as well as the mutually other P1 laptop. This way, all tunnels can easily be created once logged on to this ``sailing`` account.
70
+
71
+There are also still two personal accounts ``uhl`` and ``tim`` and an Eclipse development environment under ``/usr/local/eclipse``.
72
+
73
+### Hostnames
74
+
75
+DNS is available on site on the gateway host ``10.1.0.6``. This is essential for resolving ``www.igtimi.com``, the AWS SES SMTP server at ``email-smtp.eu-west-1.amazonaws.com`` and all e-mail address's domains for sendmail's domain verification. The DNS server is set for both, ``sap-p1-1`` and ``sap-p1-2``. It can be set from the command line using ``nmcli connection modify Wired\ connection\ 2 ipv4.dns "10.1.0.6"; nmcli connection down Wired\ connection\ 2; nmcli connection up Wired\ connection\ 2``. Currently, when testing in the SAP facilities with the SAP Guest WiFi, possibly changing IP addresses have to be updated in ``/etc/hosts``.
76
+
77
+The domain name has been set to ``sapsailing.com`` so that the fully-qualified host names are ``sap-p1-1.sapsailing.com`` and ``sap-p1-2.sapsailing.com`` respectively. Using this domain name is helpful later when it comes to the shared security realm established with the central ``security-service.sapsailing.com`` replica set.
78
+
79
+The hostname ``www.sapsailing.com`` is required by master instances when connected to the Internet in order to download polar data and wind estimation data from the archive server. Since direct access to ``www.sapsailing.com`` is blocked, we run this through the SSH tunnel to our jump host; in order to have matching certificates and appropriate hostname-based routing in the cloud for requests to ``www.sapsailing.com`` we alias this hostname in ``/etc/hosts`` to ``127.0.0.1`` (localhost).
80
+
81
+### IP Addresses and VPN
82
+
83
+Here are the IP addresses as indicated by SwissTiming:
84
+
85
+```
86
+Host Internal IP VPN IP
87
+-----------------------------------------------------------------------------------------
88
+TracTrac A (Linux) 10.1.1.104 10.8.0.128 STSP-SAL_client28
89
+TracTrac B (Linux) 10.1.1.105 10.8.0.129 STSP-SAL_client29
90
+SAP Analytics 1 Server A (Linux) 10.1.3.195 10.8.0.130 STSP-SAL_client30
91
+SAP Analytics 2 Server B (Linux) 10.1.3.197 10.8.0.131
92
+SAP Client Jan (Windows) 10.1.3.220 10.8.0.132
93
+SAP Client Alexandro (Windows) 10.1.3.221 10.8.0.133
94
+SAP Client Axel (Windows) 10.1.3.227 10.8.0.134
95
+TracTrac Dev Jorge (Linux) 10.1.3.228 10.8.0.135
96
+TracTrac Dev Chris (Linux) 10.1.3.233 10.8.0.136
97
+```
98
+
99
+The OpenVPN connection is set up with the GUI of the Linux Desktop. Therefore the management is done through Network Manager. Network Manager has a CLI, ``nmcli``. With that more properties of connections can be modified. The ``connection.secondaries`` property defines the UUID of a connection that will be established as soon as the initial connection is working. With ``nmcli connection show`` you will get the list of connections with the corresponding UUIDs. For the Medemblik Event the OpenVPN connection to the A server is bound to the wired interface that is used with
100
+
101
+```
102
+sudo nmcli connection modify <Wired Connection 2> +connection.secondaries <UUID-of-OpenVPN-A>
103
+```
104
+
105
+For the OpenVPN connections we have received two alternative configuration files together with keys and certificates for our server and work laptops, as well as the certificates for the OpenVPN server (``ca.crt``, ``dh.pem``, ``pfs.key``). The "A" configuration, e.g., provided in a file named ``st-soft-aws_A.ovpn``, looks like this:
106
+
107
+```
108
+client
109
+dev tun
110
+proto udp
111
+remote 3.122.96.235 1195
112
+ca ca.crt
113
+cert {name-of-the-certificate}.crt
114
+key {name-of-the-key}.key
115
+tls-version-min 1.2
116
+tls-cipher TLS-ECDHE-RSA-WITH-AES-128-GCM-SHA256:TLS-ECDHE-ECDSA-WITH-AES-128-GCM-SHA256:TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384:TLS-DHE-RSA-WITH-AES-256-CBC-SHA256
117
+cipher AES-256-CBC
118
+auth SHA512
119
+resolv-retry infinite
120
+auth-retry none
121
+nobind
122
+persist-key
123
+persist-tun
124
+ns-cert-type server
125
+comp-lzo
126
+verb 3
127
+tls-client
128
+tls-auth pfs.key
129
+```
130
+
131
+Here, ``{name-of-the-certificate}.crt`` and ``{name-of-the-key}.key`` need to be replaced by the names of the files corresponding with the host to connect to the OpenVPN. The "B" configuration only differs in the ``remote`` specification, using a different IP address for the OpenVPN server, namely ``52.59.130.167``. It is useful to copy the ``.ovpn`` file and the other ``.key`` and ``.crt`` files into one directory.
132
+
133
+Under Windows download the latest OpenVPN client from [https://openvpn.net/client-connect-vpn-for-windows/](https://openvpn.net/client-connect-vpn-for-windows/). After installation, use the ``.ovpn`` file, adjusted with your personalized key/certificate, to establish the connection.
134
+
135
+On Linux, go to the global settings through Gnome, node "Network" and press the "+" button next to VPN. Import the ``.ovpn`` file, then enable the OpenVPN connection by flicking the switch. The connection will show in the output of
136
+
137
+```
138
+ nmcli connection show
139
+```
140
+
141
+The connection IDs will be shown, e.g., ``st-soft-aws_A``. Such a connection can be stopped and restarted from the command line using the following commands:
142
+
143
+```
144
+ nmcli connection down st-soft-aws_A
145
+ nmcli connection up st-soft-aws_A
146
+```
147
+
148
+### Tunnels
149
+
150
+On both laptops there is a script ``/usr/local/bin/tunnels`` which establishes SSH tunnels using the ``autossh`` tool. The ``autossh`` processes are forked into the background using the ``-f`` option. It seems important to then pass the port to use for sending heartbeats using the ``-M`` option. If this is omitted, according to my experience only one of several ``autossh`` processes survives. However, we have also learned that using the ``-M`` option together with the "port" ``0`` can help to stabilize the connection because in some cases, if ``-M`` is used with a real port, port collisions may result, and furthermore when re-connecting the release of those heartbeat ports cannot become an issue which otherwise it sometimes does. The ``-M 0`` option is particularly helpful when tunnelling to ``sapsailing.com`` which is provided through a network load balancer (NLB).
151
+
152
+During regular operations we assume that we have an Internet connection that allows us to reach our jump host ``tokyo-ssh.sapsailing.com`` through SSH, establishing various port forwards. We also expect TracTrac to have their primary server available. Furthermore, we assume both our laptops to be in service. ``sap-p1-1`` then runs the master server instance, ``sap-p1-2`` runs a local replica. The master on ``sap-p1-1`` replicates the central security service at ``security-service.sapsailing.com`` using the RabbitMQ installation on ``rabbit.internal.sapsailing.com`` in the AWS region `eu-west-1`. The port forwarding through `tokyo-ssh.sapsailing.com` (in `ap-northeast-1`) to the internal RabbitMQ address (in eu-west-1) works through VPC peering. The RabbitMQ instance used for outbound replication, both, into the cloud and for the on-site replica, is `rabbit-ap-northeast-1.sapsailing.com`. The replica on ``sap-p1-2`` obtains its replication stream from there, and for the HTTP connection for "reverse replication" it uses a direct connection to ``sap-p1-1``. The outside world, in particular all "S-ded-tokyo2020-m" master security groups in all regions supported, access the on-site master through a reverse port forward on our jump host ``tokyo-ssh.sapsailing.com:8888`` which under regular operations points to ``sap-p1-1:8888`` where the master process runs.
153
+
154
+On both laptops we establish a port forward from ``localhost:22443`` to ``sapsailing.com:443``. Together with the alias in ``/etc/hosts`` that aliases ``www.sapsailing.com`` to ``localhost``, requests to ``www.sapsailing.com:22443`` will end up on the archive server.
155
+
156
+On both laptops, we maintain SSH connections to ``localhost`` with port forwards to the current TracTrac production server for HTTP, live data, and stored data. In the test we did on 2021-05-25, those port numbers were 9081, 14001, and 14011, respectively, for the primary server, and 9082, 14002, and 14012, respectively, for the secondary server. In addition to these port forwards, an entry in ``/etc/hosts`` is required for the hostname that TracTrac will use on site for their server(s), pointing to ``127.0.0.1`` to let the Sailing Analytics process connect to localhost with the port forwards. Tests have shown that if the port forwards are changed during live operations, e.g., to point to the secondary instead of the primary TracTrac server, the TracAPI continues smoothly which is a great way of handling such a fail-over process without having to re-start our master server necessarily or reconnect to all live races.
157
+
158
+Furthermore, for administrative SSH access from outside, we establish reverse port forwards from our jump host ``tokyo-ssh.sapsailing.com`` to the SSH ports on ``sap-p1-1`` (on port 18122) and ``sap-p1-2`` (on port 18222).
159
+
160
+Both laptops have a forward from ``localhost:22222`` to ``sapsailing.com:22`` through ``tokyo-ssh.sapsailing.com``, in order to be able to have a git remote ``ssh`` with the url ``ssh://trac@localhost:22222/home/trac/git``.
161
+
162
+The port forwards vary for exceptional situations, such as when the Internet connection is not available, or when ``sap-p1-1`` that regularly runs the master process fails and we need to make ``sap-p1-2`` the new master. See below for the details of the configurations for those scenarios.
163
+
164
+The tunnel configurations are established and configured using a set of scripts, each to be found under ``/usr/local/bin`` on each of the two laptops.
165
+
166
+#### ssh_config and sshd_config tweaks
167
+
168
+In order to recover quickly from failures we changed ``/etc/ssh/ssh_config`` on both of the P1s and added the following parameters:
169
+```
170
+ExitOnForwardFailure yes
171
+ConnectTimeout 10
172
+ServerAliveCountMax 3
173
+ServerAliveInterval 10
174
+```
175
+For the server side on tokyo-ssh and on the both P1s the following parameters have been added to ``/etc/ssh/sshd_config``:
176
+```
177
+ClientAliveInterval 3
178
+ClientAliveCountMax 3
179
+```
180
+
181
+ExitOnForwardFailure will force ssh to exit if one of the port forwards fails. ConnectTimeout manages the time in seconds until an initial connection fails. AliveInterval (client and server) manages the time in seconds after ssh/sshd are sending client and server alive probes. CountMax is the number of retries for those probes.
182
+
183
+The settings have been verified by executing a network change on both the laptops, the ssh tunnel returns after a couple of seconds.
184
+
185
+#### Regular Operations: master on sap-p1-1, replica on sap-p1-2, with Internet / Cloud connection
186
+
187
+On sap-p1-1 two SSH connections are maintained, with the following default port forwards, assuming sap-p1-1 is the local master:
188
+
189
+* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10201<--10201; 18122<--22; 443:security-service.sapsailing.com:443; 8888<--8888; 9443<--9443
190
+* sap-p1-2: 10202-->10202; 10201<--10201
191
+
192
+On sap-p1-2, the following SSH connections are maintained, assuming sap-p1-2 is the local replica:
193
+
194
+- tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10202<--10202; 9444<--9443
195
+
196
+A useful set of entries in your personal ``~/.ssh/config`` file for "off-site" use may look like this:
197
+
198
+```
199
+Host tokyo
200
+ Hostname tokyo-ssh.sapsailing.com
201
+ User ec2-user
202
+ ForwardAgent yes
203
+ ForwardX11Trusted yes
204
+ LocalForward 18122 localhost:18122
205
+ LocalForward 18222 localhost:18222
206
+ LocalForward 9443 localhost:9443
207
+ LocalForward 9444 localhost:9444
208
+
209
+Host sap-p1-1
210
+ Hostname localhost
211
+ Port 18122
212
+ User sailing
213
+ ForwardAgent yes
214
+ ForwardX11Trusted yes
215
+
216
+Host sap-p1-2
217
+ Hostname localhost
218
+ Port 18222
219
+ User sailing
220
+ ForwardAgent yes
221
+ ForwardX11Trusted yes
222
+```
223
+
224
+It will allow you to log on to the "jump host" ``tokyo-ssh.sapsailing.com`` with the simple command ``ssh tokyo`` and will establish the port forwards that will then allow you to connect to the two laptops using ``ssh sap-p1-1`` and ``ssh sap-p1-2``, respectively. Of course, when on site and with the two laptops in direct reach you may adjust the host entries for ``sap-p1-1`` and ``sap-p1-2`` accordingly, and you may then wish to establish only an SSH connection to ``sap-p1-1`` which then does the port forwards for HTTPS ports 9443/9444. This could look like this:
225
+
226
+```
227
+Host sap-p1-1
228
+ Hostname 10.1.3.195
229
+ Port 22
230
+ User sailing
231
+ ForwardAgent yes
232
+ ForwardX11Trusted yes
233
+ LocalForward 9443 localhost:9443
234
+ LocalForward 9444 10.1.3.197:9443
235
+
236
+Host sap-p1-2
237
+ Hostname 10.1.3.197
238
+ Port 22
239
+ User sailing
240
+ ForwardAgent yes
241
+ ForwardX11Trusted yes
242
+```
243
+
244
+#### Operations with sap-p1-1 failing: master on sap-p1-2, with Internet / Cloud connection
245
+
246
+On sap-p1-1, if the operating system still runs and the failure affects only the Java process running the SAP Sailing Analytics, two SSH connections are maintained, with the following default port forwards, assuming sap-p1-1 is not running an SAP Sailing Analytics process currently:
247
+
248
+* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10201<--10201; 18122<--22; 443:security-service.sapsailing.com:443
249
+* sap-p1-2: 10202-->10202; 10201<--10201
250
+
251
+On sap-p1-2 two SSH connections are maintained, with the following default port forwards, assuming sap-p1-2 is the local master:
252
+
253
+* tokyo-ssh.sapsailing.com: 10203-->10203; 5763-->rabbit-ap-northeast-1.sapsailing.com:5762; 15763-->rabbit-ap-northeast-1.sapsailing.com:15672; 5675:rabbit.internal.sapsailing.com:5672; 15675:rabbit.internal.sapsailing.com:15672; 10202<--10202; 18222<--22; 443:security-service.sapsailing.com:443; 8888<--8888
254
+* sap-p1-1 (if the operating system on sap-p1-1 still runs): 10202-->10202; 10201<--10201
255
+
256
+So the essential change is that the reverse forward from ``tokyo-ssh.sapsailing.com:8888`` now targets ``sap-p1-2:8888`` where we now assume the failover master to be running.
257
+
258
+#### Operations with Internet failing
259
+
260
+When the Internet connection fails, replicating the security service from ``security-service.sapsailing.com`` / ``rabbit.internal.sapsailing.com`` will no longer be possible. Neither will outbound replication to ``rabbit-ap-northeast-1.sapsailing.com`` be possible, and cloud replicas won't be able to reach the on-site master anymore through the ``tokyo-ssh.sapsailing.com:8888`` reverse port forward. This also has an effect on the local on-site replica which no longer will be able to reach ``rabbit-ap-northeast-1.sapsailing.com`` which provides the on-site replica with the operation stream under regular circumstances.
261
+
262
+There is little we can do against the lack of Internet connection regarding providing data to the cloud replicas and maintaining replication with ``security-service.sapsailing.com`` (we could theoretically try to work with local WiFi hotspots; but the key problem will be that TracTrac then neither has Internet connectivity for their on-site server, and we would have to radically change to a cloud-only set-up which is probably beyond what we'd be doing in this case). But we can ensure continued local operations with the replica on ``sap-p1-2`` now using a local on-site RabbitMQ installation between the two instances. For this, we replace the port forwards that during regular operations point to ``rabbit-ap-northeast-1.sapsailing.com`` by port forwards pointing to the RabbitMQ process on ``sap-p1-2``.
263
+
264
+On ``sap-p1-1`` an SSH connection to ``sap-p1-2`` is maintained, with the following port forwards:
265
+
266
+* sap-p1-2: 10202-->10202; 10201<--10201; 5763-->localhost:5672
267
+
268
+So the essential changes are that there are no more SSH connections into the cloud, and the port forward on each laptop's port 5673, which would point to ``rabbit-ap-northeast-1.sapsailing.com`` during regular operations, now points to ``sap-p1-2:5672`` where the RabbitMQ installation takes over from the cloud instance.
269
+
270
+### Letsencrypt Certificate for tokyo2020.sapsailing.com, security-service.sapsailing.com and tokyo2020-master.sapsailing.com
271
+
272
+In order to allow us to access ``tokyo2020.sapsailing.com`` and ``security-service.sapsailing.com`` with any HTTPS port forwarding locally so that all ``JSESSION_GLOBAL`` etc. cookies with their ``Secure`` attribute are delivered properly, we need an SSL certificate. I've created one by doing
273
+
274
+```
275
+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d tokyo2020.sapsailing.com
276
+/usr/bin/sudo -u certbot docker run --rm -it --name certbot -v "/etc/letsencrypt:/etc/letsencrypt" -v "/var/lib/letsencrypt:/var/lib/letsencrypt" certbot/certbot certonly --manual -d security-service.sapsailing.com
277
+```
278
+
279
+as ``root`` on ``sapsailing.com``. The challenge displayed can be solved by creating an ALB rule for hostname header ``tokyo2020.sapsailing.com`` and the path as issued in the output of the ``certbot`` command, and as action specify a fixed response, response code 200, and pasting as text/plain the challenge data printed by the ``certbot`` command. Wait a few seconds, then confirm the Certbot prompt. The certificate will be issued and stored under ``/etc/letsencrypt/live/tokyo2020.sapsailing.com`` from where I copied it to ``/home/sailing/Downloads/letsencrypt`` on both laptops for later use with a local Apache httpd server. The certificate will expire on 2021-08-19, so after the Olympic Games, so we don't have to worry about renewing it.
280
+
281
+### Local NGINX Webserver Setup
282
+
283
+In order to be able to access the applications running on the local on-site laptops using HTTPS there is a web server on each of the two laptops, listening on port 9443 (HTTPS). The configuration for this is under ``/etc/nginx/sites-enabled/tokyo2020`` and looks like this:
284
+
285
+```
286
+server {
287
+ listen 9443 ssl;
288
+ server_name tokyo2020.sapsailing.com;
289
+ ssl_certificate /etc/ssl/certs/tokyo2020.sapsailing.com.crt;
290
+ ssl_certificate_key /etc/ssl/private/tokyo2020.sapsailing.com.key;
291
+ ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
292
+ ssl_ciphers HIGH:!aNULL:!MD5;
293
+
294
+ location / {
295
+ proxy_pass http://127.0.0.1:8888;
296
+ }
297
+}
298
+```
299
+
300
+The "Let's Encrypt"-provided certificate is used for SSL termination. With tokyo2020.sapsailing.com aliased in ``/etc/hosts`` to the address of the current master server, this allows accessing ``https://tokyo2020.sapsailing.com:9443`` with all benefits of cookie / session authentication.
301
+
302
+Likewise, ``/etc/nginx/sites-enabled/security-service`` forwards to 127.0.0.1:8889 where a local copy of the security service may be deployed in case the Internet fails. In this case, the local port 443 must be forwarded to the NGINX port 9443 instead of security-service.sapsailing.com:443 through tokyo-ssh.sapsailing.com.
303
+
304
+On sap-p1-1 is currently a nginx listening to tokyo2020-master.sapsailing.com with the following configuration:
305
+
306
+```
307
+server {
308
+ listen 9443 ssl;
309
+ server_name tokyo2020-master.sapsailing.com;
310
+ ssl_certificate /etc/ssl/private/tokyo2020-master.sapsailing.com.fullchain.pem;
311
+ ssl_certificate_key /etc/ssl/private/tokyo2020-master.sapsailing.com.privkey.pem;
312
+ ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
313
+ ssl_ciphers HIGH:!aNULL:!MD5;
314
+
315
+ location / {
316
+ proxy_pass http://127.0.0.1:8888;
317
+ }
318
+}
319
+```
320
+
321
+
322
+
323
+### Backup
324
+
325
+borgbackup is used to backup the ``/`` folder of both laptops towards the other machine. Folder where the borg repository is located is: ``/backup``.
326
+
327
+The backup from sap-p1-1 to sap-p1-2 runs at 01:00 each day, and the backup from sap-p1-2 to sap-p1-1 runs at 02:00 each day. Details about the configuration can be found in ``/root/borg-backup.sh`` on either machine. Log files for the backup run are in ``/var/log/backup.log``. Crontab file is in ``/root``.
328
+
329
+Both ``/backup`` folders have been mirrored to a S3 bucket called ``backup-sap-p1`` on June 14th.
330
+
331
+### Monitoring and e-Mail Alerting
332
+
333
+To be able to use ``sendmail`` to send notifications via email it needs to be installed and configured to use the AWS SES as smtp relay:
334
+```
335
+sudo apt install sendmail
336
+```
337
+
338
+Follow the instructions on [https://docs.aws.amazon.com/ses/latest/DeveloperGuide/send-email-sendmail.html](https://docs.aws.amazon.com/ses/latest/DeveloperGuide/send-email-sendmail.html) with one exception, the content that needs to be added to ``sendmail.mc`` looks like:
339
+```
340
+define(`SMART_HOST', `email-smtp.eu-west-1.amazonaws.com')dnl
341
+define(`RELAY_MAILER_ARGS', `TCP $h 587')dnl
342
+define(`confAUTH_MECHANISMS', `LOGIN PLAIN')dnl
343
+FEATURE(`authinfo', `hash -o /etc/mail/authinfo.db')dnl
344
+MASQUERADE_AS(`sapsailing.com')dnl
345
+FEATURE(masquerade_envelope)dnl
346
+FEATURE(masquerade_entire_domain)dnl
347
+```
348
+The authentication details can be fetched from the content of ``/root/mail.properties`` of any running sailing EC2 instance.
349
+
350
+Both laptops, ``sap-p1-1`` and ``sap-p1-2`` have monitoring scripts from the git folder ``configuration/on-site-scripts`` linked to ``/usr/local/bin``. These in particular include ``monitor-autossh-tunnels`` and ``monitor-mongo-replica-set-delay`` as well as a ``notify-operators`` script which contains the list of e-mail addresses to notify in case an alert occurs.
351
+
352
+The ``monitor-autossh-tunnels`` script checks all running ``autossh`` processes and looks for their corresponding ``ssh`` child processes. If any of them is missing, an alert is sent using ``notify-operators``.
353
+
354
+The ``monitor-mongo-replica-set-delay`` looks as the result of calling ``rs.printSecondaryReplicationInfo()`` and logs it to ``/tmp/mongo-replica-set-delay``. The average of the last ten values is compared to a threshold (currently 3s), and an alert is sent using ``notify-operators`` if the threshold is exceeded.
355
+
356
+The ``monitor-disk-usage`` script checks the partition holding ``/var/lib/mongodb/``. Should it fill up to more than 90%, an alert will be sent using ``notify-operators``.
357
+
358
+### Time Synchronizing
359
+Setup chronyd service on desktop machine, in order to regurlary connect via VPN and relay the time towards the two P1s. Added
360
+```
361
+# Tokyo2020 configuration
362
+server 10.1.3.221 iburst
363
+```
364
+to ``/etc/chrony/chrony.conf`` on the clients.
365
+Added
366
+```
367
+# FOR TOKYO SERVER SETUP
368
+allow all
369
+local stratum 10
370
+```
371
+to the server file, started ```chronyd``` service.
372
+
373
+## AWS Setup
374
+
375
+Our primary AWS region for the event will be Tokyo (ap-northeast-1). There, we have reserved the elastic IP ``52.194.91.94`` to which we've mapped the Route53 hostname ``tokyo-ssh.sapsailing.com`` with a simple A-record. The host assigned to the IP/hostname is to be used as a "jump host" for SSH tunnels. It runs Amazon Linux with a login-user named ``ec2-user``. The ``ec2-user`` has ``sudo`` permission. In the root user's crontab we have the same set of scripts hooked up that in our eu-west-1 production landscape is responsible for obtaining and installing the landscape manager's SSH public keys to the login user's account, aligning the set of ``authorized_keys`` with those of the registered landscape managers (users with permission ``LANDSCAPE:MANAGE:AWS``). The ``authorized_keys.org`` file also contains the two public SSH keys of the ``sailing`` accounts on the two laptops, so each time the script produces a new ``authorized_keys`` file for the ``ec2-user``, the ``sailing`` keys for the laptop tunnels don't get lost.
376
+
377
+I added the EPEL repository like this:
378
+
379
+```
380
+ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
381
+```
382
+
383
+Our "favorite" Availability Zone (AZ) in ap-northeast-1 is "1d" / "ap-northeast-1d".
384
+
385
+The same host ``tokyo-ssh.sapsailing.com`` also runs a MongoDB 3.6 instance on port 10203.
386
+
387
+For RabbitMQ we run a separate host, based on AWS Ubuntu 20. It brings the ``rabbitmq-server`` package with it (version 3.8.2 on Erlang 22.2.7), and we'll install it with default settings, except for the following change: In the new file ``/etc/rabbitmq/rabbitmq.conf`` we enter the line
388
+
389
+```
390
+ loopback_users = none
391
+```
392
+
393
+which allows clients from other hosts to connect (note how this works differently on different version of RabbitMQ; the local laptops have to use a different syntax in their ``rabbitmq.config`` file). The security groups for the RabbitMQ server are configured such that only ``172.0.0.0/8`` addresses from our VPCs can connect.
394
+
395
+The RabbitMQ management plugin is enabled using ``rabbitmq-plugins enable rabbitmq_management`` for access from localhost. This will require again an SSH tunnel to the host. The host's default user is ``ubuntu``. The RabbitMQ management plugin is active on port 15672 and accessible only from localhost or an SSH tunnel with port forward ending at this host. RabbitMQ itself listens on the default port 5672. With this set-up, RabbitMQ traffic for this event remains independent and undisturbed from any other RabbitMQ traffic from other servers in our default ``eu-west-1`` landscape, such as ``my.sapsailing.com``. The hostname pointing to the internal IP address of the RabbitMQ host is ``rabbit-ap-northeast-1.sapsailing.com`` and has a timeout of 60s.
396
+
397
+An autossh tunnel is established from ``tokyo-ssh.sapsailing.com`` to ``rabbit-ap-northeast-1.sapsailing.com`` which forwards port 15673 to port 15672, thus exposing the RabbitMQ web interface which otherwise only responds to localhost. This autossh tunnel is established by a systemctl service that is described in ``/etc/systemd/system/autossh-port-forwards.service`` in ``tokyo-ssh.sapsailing.com``.
398
+
399
+### Local setup of rabbitmq
400
+
401
+The above configuration needs also to be set on the rabbitmq installations of the P1s. The rabbitmq-server package has version 3.6.10. In that version the config file is located in ``/etc/rabbitmq/rabbitmq.config``, the entry is ``[{rabbit, [{loopback_users, []}]}].`` Further documentation for this version can be found here: [http://previous.rabbitmq.com/v3_6_x/configure.html](http://previous.rabbitmq.com/v3_6_x/configure.html)
402
+
403
+### Cross-Region VPC Peering
404
+
405
+The primary AWS region for the tokyo2020 replica set is ap-northeast-1 (Tokyo). In order to provide low latencies for the RHBs we'd like to add replicas also in other regions. Since we want to not expose the RabbitMQ running ap-northeast-1 to the outside world, we plan to peer the VPCs of other regions with the one in ap-northeast-1.
406
+
407
+The pre-requisite for VPCs to get peered is that their CIDRs (such as 172.31.0.0/16) don't overlap. The default VPC in each region always uses the same CIDR (172.31.0.0/16), and hence in order to peer VPCs all but one must be non-default VPC. To avoid confusion when launching instances or setting up security groups it can be adequate for those peering regions other than our default region ``eu-west-1`` to set up non-default VPCs with peering-capable CIDRs and remove the default VPC. This way users cannot accidentally launch instances or define security groups for any VPC other than the peered one.
408
+
409
+After having peered the VPCs, the VPCs default routing table must be extended by a route to the peered VPC's CIDR using the peering connection.
410
+
411
+With peering in place it is possible to reach instances in peered VPCs by their internal IPs. In particular, it is possible to connect to a RabbitMQ instance with the internal IP and port 5672 even if that RabbitMQ runs in a different region whose VPC is peered.
412
+
413
+### Global Accelerator
414
+
415
+We have created a Global Accelerator [Tokyo2020](https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#AcceleratorDetails:AcceleratorArn=arn:aws:globalaccelerator::017363970217:accelerator/8ddd5afb-dd8d-4e8b-a22f-443a47240a94) which manages cross-region load balancing for us. There are two listeners: one for port 80 (HTTP) and one for port 443 (HTTPS). For each region an endpoint group must be created for both of the listeners, and the application load balancer (ALB) in that region has to be added as an endpoint.
416
+
417
+The Route53 entry ``tokyo2020.sapsailing.com`` now is an alias A record pointing to this global accelerator (``aca060e6eabf4ba3e.awsglobalaccelerator.com.``).
418
+
419
+### Geo-Blocking
420
+
421
+While for Tokyo 2020 this was not requested, for Paris 2024 we heard rumors that it may. If it does, using the [AWS Web Application Firewall (WAF)](https://us-east-1.console.aws.amazon.com/wafv2/homev2/start) provides the solution. There, we can create so-called Web Access Control Lists (Web ACLs) which need to be created per region where an ALB is used.
422
+
423
+A Web ACL consists of a number of rules and has a default action (typically "Allow" or "Block") for those requests not matched by any rule. An ACL can be associated with one or more resources, in particular with Application Load Balancers (ALBs) deployed in the region.
424
+
425
+Rules, in turn, consist of statements that can be combined using logical operators. The rule type of interest for geo-blocking is "Originates from a country in" where one or more countries can be selected. When combined with an "Allow" or "Block" action, this results in the geo-blocking behavior desired.
426
+
427
+For requests blocked by the rule, the response code, response headers and message body to return to the client can be configured. We can use this, e.g., to configure a 301 re-direct to a static page that informs the user about the geo-blocking.
428
+
429
+### Application Load Balancers (ALBs) and Target Groups
430
+
431
+In each region supported, a dedicated load balancer for the Global Accelerator-based event setup has been set up (``Tokyo2020ALB`` or simply ``ALB``). A single target group with the usual settings (port 8888, health check on ``/gwt/status``, etc.) must exist: ``S-ded-tokyo2020`` (public).
432
+
433
+Note that no dedicated ``-m`` master target group is established. The reason is that the AWS Global Accelerator judges an ALB's health by looking at _all_ its target groups; should only a single target group not have a healthy target, the Global Accelerator considers the entire ALB unhealthy. With this, as soon as the on-site master server is unreachable, e.g., during an upgrade, all those ALBs would enter the "unhealthy" state from the Global Accelerator's perspective, and all public replicas which are still healthy would no longer receive traffic; the site would go "black." Therefore, we must ensure that the ALBs targeted by the Global Accelerator only have a single target group which only has the public replicas in that region as its targets.
434
+
435
+Each ALB has an HTTP and an HTTPS listener. The HTTP listener has only a single rule redirecting all traffic permanently (301) to the corresponding HTTPS request. The HTTPS listener has three rules: the ``/`` path for ``tokyo2020.sapsailing.com`` is re-directed to the Olympic event with ID ``25c65ff1-68b8-4734-a35f-75c9641e52f8``. All other traffic for ``tokyo2020.sapsailing.com`` goes to the public target group holding the regional replica(s). A default rule returns a 404 status with a static ``Not found`` text.
436
+
437
+## Landscape Architecture
438
+
439
+We have applied for a single SSH tunnel to IP address ``52.194.91.94`` which is our elastic IP for our SSH jump host in ap-northeast-1(d).
440
+
441
+The default production set-up is defined as follows:
442
+
443
+### MongoDB
444
+
445
+Three MongoDB nodes are intended to run during regular operations: sap-p1-1:10201, sap-p1-2:10202, and tokyo-ssh.sapsailing.com:10203. Since we have to work with SSH tunnels to keep things connected, we map everything using ``localhost`` ports such that both, sap-p1-2 and tokyo-ssh see sap-p1-1:10201 as their localhost:10201, and that both, sap-p1-1 and tokyo-ssh see sap-p1-2:10202 as their respective localhost:10202. Both, sap-p1-1 and sap-p1-2 see tokyo-ssh:10203 as their localhost:10203. This way, the MongoDB URI can be specified as
446
+
447
+```
448
+ mongodb://localhost:10201,localhost:10202,localhost:10203/tokyo2020?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest
449
+```
450
+
451
+The cloud replica is not supposed to become primary, except for maybe in the unlikely event where operations would move entirely to the cloud. To achieve this, the cloud replica has priority 0 which can be configured like this:
452
+
453
+```
454
+ tokyo2020:PRIMARY> cfg = rs.conf()
455
+ # Then search for the member localhost:10203; let's assume, it's in cfg.members[0] :
456
+ cfg.members[0].priority=0
457
+ rs.reconfig(cfg)
458
+```
459
+
460
+All cloud replicas shall use a MongoDB database name ``tokyo2020-replica``. In those regions where we don't have dedicated MongoDB support established (basically all but eu-west-1 currently), an image should be used that has a MongoDB server configured to use ``/home/sailing/mongo`` as its data directory and ``replica`` as its replica set name. See AMI SAP Sailing Analytics App HVM with MongoDB 1.137 (ami-05b6c7b1244f49d54) in ap-northeast-1 (already copied to the other peered regions except eu-west-1).
461
+
462
+One way to monitor the health and replication status of the replica set is running the following command:
463
+
464
+```
465
+ watch 'echo "rs.printSecondaryReplicationInfo()" | \
466
+ mongo "mongodb://localhost:10201/?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest" | \
467
+ grep "\(^source:\)\|\(syncedTo:\)\|\(behind the primary\)"'
468
+```
469
+
470
+It shows the replication state and in particular the delay of the replicas. A cronjob exists for ``sailing@sap-p1-1`` which triggers ``/usr/local/bin/monitor-mongo-replica-set-delay`` every minute which will use ``/usr/local/bin/notify-operators`` in case the average replication delay for the last ten read-outs exceeds a threshold (currently 3s). We have a cron job monitoring this (see above) and sending out alerts if things start slowing down.
471
+
472
+In order to have a local copy of the ``security_service`` database, a CRON job exists for user ``sailing`` on ``sap-p1-1`` which executes the ``/usr/local/bin/clone-security-service-db-safe-exit`` script (versioned in git under ``configuration/on-site-scripts/clone-security-service-db-safe-exit``) once per hour. See ``/home/sailing/crontab``. The script dumps ``security_service`` from the ``live`` replica set in ``eu-west-1`` to the ``/tmp/dump`` directory on ``ec2-user@tokyo-ssh.sapsailing.com`` and then sends the directory content as a ``tar.gz`` stream through SSH and restores it on the local ``mongodb://sap-p1-1:27017,sap-p1-2/security_service?replicaSet=security_service`` replica set, after copying an existing local ``security_service`` database to ``security_service_bak``. This way, even if the Internet connection dies during this cloning process, a valid copy still exists in the local ``tokyo2020`` replica set which can be copied back to ``security_service`` using the MongoDB shell command
473
+
474
+```
475
+ db.copyDatabase("security_service_bak", "security_service")
476
+```
477
+
478
+### Master
479
+
480
+The master configuration is described in ``/home/sailing/servers/master/master.conf`` and can be used to produce a clean set-up like this:
481
+
482
+```
483
+ rm env.sh; cat master.conf | ./refreshInstance.sh auto-install-from-stdin
484
+```
485
+
486
+If the laptops cannot reach ``https://releases.sapsailing.com`` due to connectivity constraints, releases and environments can be downloaded through other channels to ``sap-p1-1:/home/trac/releases``, and the variable ``INSTALL_FROM_SCP_USER_AT_HOST_AND_PORT`` can be set to ``sailing@sap-p1-1`` to fetch the release file and environment file from there by SCP. Alternatively, ``sap-p1-2:/home/trac/releases`` may be used for the same.
487
+
488
+This way, a clean new ``env.sh`` file will be produced from the config file, including the download and installation of a release. The ``master.conf`` file looks approximately like this:
489
+
490
+```
491
+INSTALL_FROM_RELEASE=build-202106012325
492
+SERVER_NAME=tokyo2020
493
+MONGODB_URI="mongodb://localhost:10201,localhost:10202,localhost:10203/${SERVER_NAME}?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest"
494
+# RabbitMQ in eu-west-1 (rabbit.internal.sapsailing.com) is expected to be found through SSH tunnel on localhost:5675
495
+# Replication of shared services from central security-service.sapsailing.com through SSH tunnel 443:security-service.sapsailing.com:443
496
+# with a local /etc/hosts entry mapping security-service.sapsailing.com to 127.0.0.1
497
+REPLICATE_MASTER_QUEUE_HOST=localhost
498
+REPLICATE_MASTER_QUEUE_PORT=5675
499
+REPLICATE_MASTER_BEARER_TOKEN="***"
500
+# Outbound replication to RabbitMQ through SSH tunnel with port forward on port 5673, regularly to rabbit-ap-northeast-1.sapsailing.com
501
+# Can be re-mapped to the RabbitMQ running on sap-p1-2
502
+REPLICATION_HOST=localhost
503
+REPLICATION_PORT=5673
504
+USE_ENVIRONMENT=live-master-server
505
+ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
506
+```
507
+
508
+### Replicas
509
+
510
+The on-site replica on ``sap-p1-2`` can be configured with a ``replica.conf`` file in ``/home/sailing/servers/replica``, using
511
+
512
+```
513
+ rm env.sh; cat replica.conf | ./refreshInstance auto-install-from-stdin
514
+```
515
+
516
+The file looks like this:
517
+
518
+```
519
+# Regular operations; sap-p1-2 replicates sap-p1-1 using the rabbit-ap-northeast-1.sapsailing.com RabbitMQ in the cloud through SSH tunnel.
520
+# Outbound replication, though not expected to become active, goes to a local RabbitMQ
521
+INSTALL_FROM_RELEASE=build-202106012325
522
+SERVER_NAME=tokyo2020
523
+MONGODB_URI="mongodb://localhost:10201,localhost:10202,localhost:10203/${SERVER_NAME}-replica?replicaSet=tokyo2020&retryWrites=true&readPreference=nearest"
524
+# RabbitMQ in ap-northeast-1 is expected to be found locally on port 5673
525
+REPLICATE_MASTER_SERVLET_HOST=sap-p1-1
526
+REPLICATE_MASTER_SERVLET_PORT=8888
527
+REPLICATE_MASTER_QUEUE_HOST=localhost
528
+REPLICATE_MASTER_QUEUE_PORT=5673
529
+REPLICATE_MASTER_BEARER_TOKEN="***"
530
+# Outbound replication to RabbitMQ running locally on sap-p1-2
531
+REPLICATION_HOST=localhost
532
+REPLICATION_PORT=5672
533
+REPLICATION_CHANNEL=${SERVER_NAME}-replica
534
+USE_ENVIRONMENT=live-replica-server
535
+ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
536
+```
537
+
538
+Replicas in region ``eu-west-1`` can be launched using the following user data, making use of the established MongoDB live replica set in the region:
539
+
540
+```
541
+INSTALL_FROM_RELEASE=build-202106012325
542
+SERVER_NAME=tokyo2020
543
+MONGODB_URI="mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com,dbserver.internal.sapsailing.com:10203/tokyo2020-replica?replicaSet=live&retryWrites=true&readPreference=nearest"
544
+USE_ENVIRONMENT=live-replica-server
545
+REPLICATION_CHANNEL=tokyo2020-replica
546
+REPLICATION_HOST=rabbit-ap-northeast-1.sapsailing.com
547
+REPLICATE_MASTER_SERVLET_HOST=tokyo-ssh.internal.sapsailing.com
548
+REPLICATE_MASTER_SERVLET_PORT=8888
549
+REPLICATE_MASTER_EXCHANGE_NAME=tokyo2020
550
+REPLICATE_MASTER_QUEUE_HOST=rabbit-ap-northeast-1.sapsailing.com
551
+REPLICATE_MASTER_BEARER_TOKEN="***"
552
+ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
553
+```
554
+
555
+(Adjust the release accordingly, of course). (NOTE: During the first production days of the event we noticed that it was really a BAD IDEA to have all replicas use the same DB set-up, all writing to the MongoDB PRIMARY of the "live" replica set in eu-west-1. With tens of replicas running concurrently, this led to a massive block-up based on MongoDB not writing fast enough. This gave rise to a new application server AMI which now has a MongoDB set-up included, using "replica" as the MongoDB replica set name. Now, each replica hence can write into its own MongoDB instance, isolated from all others and scaling linearly.)
556
+
557
+In other regions, instead an instance-local MongoDB shall be used for each replica, not interfering with each other or with other databases:
558
+
559
+```
560
+INSTALL_FROM_RELEASE=build-202106012325
561
+SERVER_NAME=tokyo2020
562
+MONGODB_URI="mongodb://localhost/tokyo2020-replica?replicaSet=replica&retryWrites=true&readPreference=nearest"
563
+USE_ENVIRONMENT=live-replica-server
564
+REPLICATION_CHANNEL=tokyo2020-replica
565
+REPLICATION_HOST=rabbit-ap-northeast-1.sapsailing.com
566
+REPLICATE_MASTER_SERVLET_HOST=tokyo-ssh.internal.sapsailing.com
567
+REPLICATE_MASTER_SERVLET_PORT=8888
568
+REPLICATE_MASTER_EXCHANGE_NAME=tokyo2020
569
+REPLICATE_MASTER_QUEUE_HOST=rabbit-ap-northeast-1.sapsailing.com
570
+REPLICATE_MASTER_BEARER_TOKEN="***"
571
+ADDITIONAL_JAVA_ARGS="${ADDITIONAL_JAVA_ARGS} -Dcom.sap.sse.debranding=true"
572
+```
573
+
574
+### Application Servers
575
+
576
+``sap-p1-1`` normally is the master for the ``tokyo2020`` replica set. The application server directory is found under ``/home/sailing/servers/master``, and the master's HTTP port is 8888. It shall replicate the shared services, in particular ``SecurityServiceImpl``, from ``security-service.sapsailing.com``, like any normal server in our landscape, only that here we have to make sure we can target the default RabbitMQ in eu-west-1 and can see the ``security-service.sapsailing.com`` master directly or even better the load balancer.
577
+
578
+SSH local port forwards (configured with the ``-L`` option) that use hostnames instead of IP addresses for the remote host specification are resolved each time a new connection is established through this forward. If the DNS entry resolves to multiple IPs or if the DNS entry changes over time, later connection requests through the port forward will honor the new host name's DNS resolution.
579
+
580
+Furthermore, there is a configuration under ``/home/sailing/servers/security_service`` which can be fired up with port 8889, using the local ``security_service`` database that a script ``/usr/local/bin/clone-security-service-db`` on the jump host ``tokyo-ssh.sapsailing.com`` updates on an hourly basis as long as an Internet connection is available. This can be used as a replacement of the official ``security-service.sapsailing.com`` service. Both laptops have an ``/etc/hosts`` entry mapping ``security-service.sapsailing.com`` to ``127.0.0.1`` and work with flexible SSH port forwards to decide whether the official Internet-based or the local copy of the security service shall be used.
581
+
582
+``sap-p1-2`` normally is a replica for the ``tokyo2020`` replica set, using the local RabbitMQ running on ``sap-p1-1``. Its outbound ``REPLICATION_CHANNEL`` will be ``tokyo2020-replica`` and uses the RabbitMQ running in ap-northeast-1, using an SSH port forward with local port 5673 for the ap-northeast-1 RabbitMQ (15673 for the web administration UI). A reverse port forward from ap-northeast-1 to the application port 8888 on ``sap-p1-2`` has to be established which replicas running in ap-northeast-1 will use to reach their master through HTTP. This way, adding more replicas on the AWS side in the cloud will not require any additional bandwidth between cloud and on-site network, except that the reverse HTTP channel, which uses only little traffic, will see additional traffic per replica whereas all outbound replication goes to the single exchange in the RabbitMQ node running in ap-northeast-1.
583
+
584
+## User Groups and Permissions
585
+
586
+The general public shall not be allowed during the live event to browse the event through ``tokyo2020.sapsailing.com``. Instead, they are required to go through any of the so-called "Rights-Holding Broadcaster" (RHB) web sites. There, a "widget" will be embedded into their web sites which works with our REST API to display links to the regattas and races, in particular the RaceBoard.html pages displaying the live and replay races.
587
+
588
+Moderators who need to comment on the races shall be given more elaborate permissions and shall be allowed to use the full-fledged functionality of ``tokyo2020.sapsailing.com``, in particular, browse through all aspects of the event, see flag statuses, postponements and so on.
589
+
590
+To achieve this effect, the ``tokyo2020-server`` group has the ``sailing_viewer`` role assigned for all users, and all objects, except for the top-level ``Event`` object are owned by that group. This way, everything but the event are publicly visible.
591
+
592
+The ``Event`` object is owned by ``tokyo2020-moderators``, and that group grants the ``sailing_viewer`` role only to its members, meaning only the members of that group are allowed to see the ``Event`` object.
593
+
594
+## Landscape Upgrade Procedure
595
+
596
+In the ``configuration/on-site-scripts`` we have prepared a number of scripts intended to be useful for local and cloud landscape management. TL;DR:
597
+```
598
+ configuration/on-site-scripts/upgrade-landscape.sh -R {release-name} -b {replication-bearer-token}
599
+```
600
+will upgrade the entire landscape to the release ``{release-name}`` (e.g., build-202107210711). The ``{replication-bearer-token}`` must be provided such that the user authenticated by that token will have the permission to stop replication and to replicate the ``tokyo2020`` master.
601
+
602
+The script will proceed in the following steps:
603
+ - patch ``*.conf`` files in ``sap-p1-1:servers/[master|security_service]`` and ``sap-p1-2:servers/[replica|master|security_service]`` so
604
+ their ``INSTALL_FROM_RELEASE`` points to the new ``${RELEASE}``
605
+ - Install new releases to ``sap-p1-1:servers/[master|security_service]`` and ``sap-p1-2:servers/[replica|master|security_service]``
606
+ - Update all launch configurations and auto-scaling groups in the cloud (``update-launch-configuration.sh``)
607
+ - Tell all replicas in the cloud to stop replicating (``stop-all-cloud-replicas.sh``)
608
+ - Tell ``sap-p1-2`` to stop replicating
609
+ - on ``sap-p1-1:servers/master`` run ``./stop; ./start`` to bring the master to the new release
610
+ - wait until master is healthy
611
+ - on ``sap-p1-2:servers/replica`` run ``./stop; ./start`` to bring up on-site replica again
612
+ - launch upgraded cloud replicas and replace old replicas in target group (``launch-replicas-in-all-regions.sh``)
613
+ - terminate all instances named "SL Tokyo2020 (auto-replica)"; this should cause the auto-scaling group to launch new instances as required
614
+ - manually inspect the health of everything and terminate the "SL Tokyo2020 (Upgrade Replica)" instances when enough new instances
615
+ named "SL Tokyo2020 (auto-replica)" are available
616
+
617
+The individual scripts will be described briefly in the following sub-sections. Many of them use as a common artifact the ``regions.txt`` file which contains the list of regions in which operations are executed. The ``eu-west-1`` region as our "legacy" or "primary" region requires special attention in some cases. In particular, it can use the ``live`` replica set for the replicas started in the region, also because the AMI used in this region is slightly different and in particular doesn't launch a MongoDB local replica set on each instance which the AMIs in all other regions supported do.
618
+
619
+### clone-security-service-db-safe-exit
620
+
621
+Creates a ``mongodump`` of "mongodb://mongo0.internal.sapsailing.com,mongo1.internal.sapsailing.com,dbserver.internal.sapsailing.com:10203/security_service?replicaSet=live&retryWrites=true&readPreference=nearest" on the ``tokyo-ssh.sapsailing.com`` host and packs it into a ``.tar.gz`` file. This archive is then transferred as the standard output of an SSH command to the host executing the script where it is unpacked into ``/tmp/dump``. The local "mongodb://localhost/security_service_bak?replicaSet=security_service&retryWrites=true&readPreference=nearest" backup copy is then dropped, the local ``security_service`` DB is moved to ``security_service_bak``, and the dump from ``/tmp/dump`` is then restored to ``security_service``. If this fails, the backup from ``security_service_bak`` is restored to ``security_service``, and there won't be a backup copy anymore in ``security_service_bak`` anymore.
622
+
623
+The script is used as a CRON job for user ``sailing@sap-p1-1``.
624
+
625
+### get-replica-ips
626
+
627
+Lists the public IP addresses of all running replicas in the regions described in ``regions.txt`` on its standard output. Progress information will be sent to standard error. Example invocation:
628
+<pre>
629
+ $ ./get-replica-ips
630
+ Region: eu-west-1
631
+ Region: ap-northeast-1
632
+ Region: ap-southeast-2
633
+ Region: us-west-1
634
+ Region: us-east-1
635
+ 34.245.148.130 18.183.234.161 3.26.60.130 13.52.238.81 18.232.169.1
636
+</pre>
637
+
638
+### launch-replicas-in-all-regions.sh
639
+
640
+Will launch as many new replicas in the regions listed in ``regions.txt`` with the release specified with ``-R`` as there are currently healthy auto-replicas registered with the ``S-ded-tokyo2020`` target group in the region (at least one) which will register at the master proxy ``tokyo-ssh.internal.sapsailing.com:8888`` and RabbitMQ at ``rabbit-ap-northeast-1.sapsailing.com:5672``, then when healthy get added to target group ``S-ded-tokyo2020`` in that region, with all auto-replicas registered before removed from the target group.
641
+
642
+The script uses the ``launch-replicas-in-region.sh`` script for each region where replicas are to be launched.
643
+
644
+Example invocation:
645
+<pre>
646
+ launch-replicas-in-all-regions.sh -R build-202107210711 -b 1234567890ABCDEFGH/+748397=
647
+</pre>
648
+
649
+Invoke without arguments to see a documentation of possible parameters.
650
+
651
+### launch-replicas-in-region.sh
652
+
653
+Will launch one or more (see ``-c``) new replicas in the AWS region specified with ``-g`` with the release specified with ``-R`` which will register at the master proxy ``tokyo-ssh.internal.sapsailing.com:8888`` and RabbitMQ at ``rabbit-ap-northeast-1.sapsailing.com:5672``, then when healthy get added to target group ``S-ded-tokyo2020`` in that region, with all auto-replicas registered before removed from the target group. Specify ``-r`` and ``-p`` if you are launching in ``eu-west-1`` because it has a special non-default MongoDB environment.
654
+
655
+Example invocation:
656
+<pre>
657
+ launch-replicas-in-region.sh -g us-east-1 -R build-202107210711 -b 1234567890ABCDEFGH/+748397=
658
+</pre>
659
+
660
+Invoke without arguments to see a documentation of possible parameters.
661
+
662
+### stop-all-cloud-replicas.sh
663
+
664
+Will tell all replicas in the cloud in those regions described by the ``regions.txt`` file to stop replicating. This works by invoking the ``get-replica-ips script`` and for each of them to stop replicating, using the ``stopReplicating.sh`` script in their ``/home/sailing/servers/tokyo2020`` directory, passing through the bearer token. Note: this will NOT stop replication on the local replica on ``sap-p1-2``!
665
+
666
+The script must be invoked with the bearer token needed to authenticate a user with replication permission for the ``tokyo2020`` application replica set.
667
+
668
+Example invocation:
669
+<pre>
670
+ stop-all-cloud-replicas.sh -b 1234567890ABCDEFGH/+748397=
671
+</pre>
672
+
673
+Invoke without arguments to see a documentation of possible parameters.
674
+
675
+### update-launch-configuration.sh
676
+
677
+Will upgrade the auto-scaling group ``tokyo2020*`` (such as ``tokyo2020-auto-replicas``) in the regions from ``regions.txt`` with a new launch configuration that will be derived from the existing launch configuration named ``tokyo2020-*`` by copying it to ``tokyo2020-{RELEASE_NAME}`` while updating the ``INSTALL_FROM_RELEASE`` parameter in the user data to the ``{RELEASE_NAME}`` provided in the ``-R`` parameter, and optionally adjusting the AMI, key pair name and instance type if specified by the respective parameters. Note: this will NOT terminate any instances in the target group!
678
+
679
+Example invocation:
680
+<pre>
681
+ update-launch-configuration.sh -R build-202107210711
682
+</pre>
683
+
684
+Invoke without arguments to see a documentation of possible parameters.
685
+
686
+### upgrade-landscape.sh
687
+
688
+See the introduction of this main section. Synopsis:
689
+<pre>
690
+ ./upgrade-landscape.sh -R &lt;release-name&gt; -b &lt;replication-bearer-token&gt; \[-t &lt;instance-type&gt;\] \[-i &lt;ami-id&gt;\] \[-k &lt;key-pair-name&gt;\] \[-s\]<br>
691
+ -b replication bearer token; mandatory
692
+ -i Amazon Machine Image (AMI) ID to use to launch the instance; defaults to latest image tagged with image-type:sailing-analytics-server
693
+ -k Key pair name, mapping to the --key-name parameter
694
+ -R release name; must be provided to select the release, e.g., build-202106040947
695
+ -t Instance type; defaults to
696
+ -s Skip release download
697
+</pre>
698
+
699
+## Log File Analysis
700
+
701
+Athena table definitions and queries have been provided in region ``eu-west-3`` (Paris) where we hosted our EU part during the event after a difficult start in ``eu-west-1`` with the single MongoDB live replica set not scaling well for all the replicas that were required in the region.
702
+
703
+The key to the Athena set-up is to have a table definition per bucket, with a dedicated S3 bucket per region where ALB logs were recorded. An example of a query based on the many tables the looks like this:
704
+<pre>
705
+ with union_table AS
706
+ (select *
707
+ from alb_logs_ap_northeast_1
708
+ union all
709
+ select *
710
+ from alb_logs_ap_southeast_2
711
+ union all
712
+ select *
713
+ from alb_logs_eu_west_3
714
+ union all
715
+ select *
716
+ from alb_logs_us_east_1
717
+ union all
718
+ select *
719
+ from alb_logs_us_west_1)
720
+ select date_trunc('day', parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')), count(distinct concat(client_ip,user_agent))
721
+ from union_table
722
+ where (parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z')
723
+ between parse_datetime('2021-07-21-00:00:00','yyyy-MM-dd-HH:mm:ss')
724
+ and parse_datetime('2021-08-08-02:00:00','yyyy-MM-dd-HH:mm:ss'))
725
+ group by date_trunc('day', parse_datetime(time,'yyyy-MM-dd''T''HH:mm:ss.SSSSSS''Z'))
726
+</pre>
727
+It defines a ``union_table`` which unites all contents from all buckets scanned.
... ...
\ No newline at end of file