wiki/howto/windestimation.md
... ...
@@ -0,0 +1,71 @@
1
+# Training of internal Wind Estimation models
2
+
3
+This document describes the generation process of Machine Learning (ML) models which are used internally by wind estimation. It is highly recommended to proceed this howto step by step considering the order of sections.
4
+
5
+## Overview
6
+In total, there are the following three categories of ML models used by wind estimation:
7
+1. **Maneuver Classifiers**
8
+2. **Regressors** of TWD delta standard deviation for the dimension **duration**
9
+3. **Regressors** of TWD delta standard deviation for the dimension **distance**
10
+
11
+Each of the model categories are composed of multiple models where each model targets a specific context. A context for a maneuver classifier is determined by the following attributes:
12
+* Maneuver features
13
+ * Polar features enabled: yes/no
14
+ * Mark features enabled: yes/no
15
+ * Scaled speed features enabled: yes/no
16
+* Boat class filtering for the data on which the classifier is trained, such as a specific boat class, or with all boat classes included
17
+
18
+The context of regressor models is represented by its assigned input interval responsibility, e.g. [0 seconds; 62 seconds) for duration, or [80 meters; 1368 meters) for distance.
19
+
20
+Each of the ML model categories must be trained individually. The common workflow looks as follows:
21
+1. Get the training data from REST API of sapsailing.com
22
+2. Preprocess data
23
+3. Train the model category
24
+
25
+For each of the steps, appropriate Java classes must be executed per *Run with...->Java Application*. All referenced classes are located in *com.sap.sailing.windestimation.lab* Java project. Each class execution must finish without uncaught exceptions before proceeding to next instructions. After model training, all trained models can be collected in *./trained_wind_estimation_models*, which is normally */path/to/workspace/com.sap.sailing.windestimation/trained_wind_estimation_models* if you start the training classes in Eclipse per *Run with...->Java Application*.
26
+
27
+The details of the training process for each model category are described in the following sections.
28
+
29
+## Prerequisites
30
+To complete the training process successfully, you need to make sure that you have the following stuff:
31
+* A complete onboarding setup for SAP Sailing Analytics development
32
+* MongoDB is up and running (same MongoDB instance as required in onboarding howto)
33
+* At least 100 GB free space on the partition, where MongoDB is operating
34
+* At least 16 GB RAM (for in-memory preprocessing of wind data for regressors)
35
+* Installed graphical MongoDB client such as MongoDB Compass (Community version)
36
+
37
+## Get the training data from sapsailing.com
38
+The following steps import all the data required from sapsailing.com into the local MongoDB. These steps constitute a preprequisite for training of all ML model categories:
39
+1. Run *com.sap.sailing.windestimation.data.importer.ManeuverAndWindImporter*
40
+2. Run *com.sap.sailing.windestimation.data.importer.PolarDataImporter*
41
+
42
+## Maneuver classifiers training
43
+1. Run *com.sap.sailing.windestimation.model.classifier.maneuver.ManeuverClassifierTrainer*
44
+2. Optionally run *com.sap.sailing.windestimation.model.classifier.maneuver.ManeuverClassifierScoring* to print the performance of the trained classifiers and to verify maneuver classification scoring
45
+
46
+Within this single step, the maneuver data is preprocessed and all maneuver classifiers are trained for each supported context.
47
+
48
+## Duration-based TWD delta standard deviation regressor
49
+
50
+1. Run *com.sap.sailing.windestimation.data.importer.DurationBasedTwdTransitionImporter*
51
+2. Run *com.sap.sailing.windestimation.data.importer.AggregatedDurationBasedTwdTransitionImporter* with at least 10 GB JVM memory. For this, set the following VM arguments in your run config: ``-Xms10g -Xmx10g``.
52
+3. Run *com.sap.sailing.windestimation.datavisualization.AggregatedDurationDimensionPlot* to visualize the wind data. A Swing-based GUI-Window must open with two charts, one XY-chart where the x-axis represents **seconds**, and the y-axis represents TWD delta-based series measures (e.g. standard deviation or mean). Below the chart, a histogram for data points of the XY-Chart is provided. You can zoom-in and zoom-out in each of the chart by mouse dragging. Be aware that currently the zoom level of both charts is not synchronized
53
+4. Open your graphical MongoDB client and connect to *windEstimation* database hosted by your local MongoDB. Open the collection with name *aggregatedDurationTwdTransition*. Within the collection you will see all the instances/data points visualized in the previous step. The total number of the points must not exceed 100.
54
+5. Delete all the instances within the collection which do not make sense. For this, use the data visualization tool from step 3 to identify such instances. Pay a special attention to the instances in the beginnning and end. Some of the instances are not representative due to small number of supporting instances which is visualized in the histogram. Restart the data visualization tool as often as need to visualize the changed data.
55
+6. Open the source code of the class *com.sap.sailing.windestimation.model.regressor.twdtransition.DurationBasedTwdTransitionRegressorModelMetadata*. It is recommended to read JavaDoc of the class. Scroll down to the definition of the inner class/enum *DurationValueRange*. The enum defines the intervals for which a separate regressor model will be trained. Adjust the intervals accordingly in order to allow the regressor model to learn the data curve with minimal error. Make sure that there are at least 2 data points available within each interval
56
+7. Run *com.sap.sailing.windestimation.model.regressor.twdtransition.DurationBasedTwdTransitionStdRegressorTrainer*
57
+8. Verify the trained regressor functions. They are printed in the console output of the previous step. For instance, you can visualize the polynoms by means of https://www.wolframalpha.com/
58
+
59
+## Distance-based TWD delta standard deviation regressor
60
+
61
+The steps of this sections are similar to the steps of the previous section. It is recommended to traverse through the previous section before starting with this one, because due to similarity of the steps, the similar steps in this section are described with less details and hints.
62
+
63
+1. Run *com.sap.sailing.windestimation.data.importer.DistanceBasedTwdTransitionImporter*
64
+2. Run *com.sap.sailing.windestimation.data.importer.AggregatedDistanceBasedTwdTransitionImporter* with at least 10 GB JVM memory.
65
+3. Run *com.sap.sailing.windestimation.datavisualization.AggregatedDistanceDimensionPlot* to visualize the wind data. Here, the x-axis of the XY-chart represents **meters**
66
+4. Open your graphical MongoDB client and connect to *windEstimation* database hosted by your local MongoDB. Open collection *aggregatedDistanceTwdTransition* collection. Within the collection you will see all the instances/data points visualized in the previous step. The total number of the points must not exceed 100.
67
+5. Delete all the instances within the collection which do not make sense.
68
+6. Open the source code of the class *com.sap.sailing.windestimation.model.regressor.twdtransition.DistanceBasedTwdTransitionRegressorModelMetadata*. Scroll down to the definition of the inner class/enum *DistanceValueRange*. The enum defines the intervals for which a separate regressor model will be trained. Adjust the intervals accordingly in order to allow the regressor model to learn the data curve with minimal error.
69
+7. Open the source code of the class *com.sap.sailing.windestimation.model.regressor.twdtransition.DistanceBasedTwdTransitionStdRegressorTrainer* and scroll down to *getTrainingDataForDistance()* method. The method returns the training datasets which will be used for the model training. Adjust the datasets accordingly so that at least two datasets intersect with [fromInclusive; toExclusive] of each specified interval in the previous step
70
+7. Run *com.sap.sailing.windestimation.model.regressor.twdtransition.DistanceBasedTwdTransitionStdRegressorTrainer*
71
+8. Verify the trained regressor functions. They are printed in the console output of previous step
... ...
\ No newline at end of file