Predictions-on-chip: model-based training and automated deployment of machine learning models at runtime

The design of gas turbines is a challenging area of cyber-physical systems where complex model-based simulations across multiple disciplines (e.g., performance, aerothermal) drive the design process. As a result, a continuously increasing amount of data is derived during system design. Finding new insights in such data by exploiting various machine learning (ML) techniques is a promising industrial trend since better predictions based on real data result in substantial product quality improvements and cost reduction. This paper presents a method that generates data from multi-paradigm simulation tools, develops and trains ML models for prediction, and deploys such prediction models into an active control system operating at runtime with limited computational power. We explore the replacement of existing traditional prediction modules with ML counterparts with different architectures. We validate the effectiveness of various ML models in the context of three (real) gas turbine bearings using over 150,000 data points for training, validation, and testing. We introduce code generation techniques for automated deployment of neural network models to industrial off-the-shelf programmable logic controllers.


Introduction
A cyber-physical system (CPS) needs to (i) autonomously perceive its operational context, (ii) adapt to changes in an open, heterogeneous and distributed environment with a massive number of nodes, (iii) dynamically acquire available resources and aggregate services in order to make real-time decisions, and (iv) continuously provide critical services in a safe, resilient and trustworthy way [8,23].
Gas turbine development is a challenging area of CPSs where complex model-based physical simulations across multiple disciplines (e.g., performance, aerothermal, secondary air system) drive the engineering process at designtime to predict relevant system parameters. However, some of these predictions need to be carried out as part of the real engine control system where the programmable logic controller (PLC) hardware has very limited computing capabilities. Since design-time simulations are computationally very expensive, dedicated runtime predictor programs need to be developed and maintained which may significantly lack in precision compared to their design-time counterparts. In this paper, we seek to leverage advanced machine learning (ML) techniques for such runtime predictor programs. Research challenges The design of safety-critical CPSs, such as gas turbines, frequently relies on systems engineering principles facilitating the use of well-defined components interacting with each other via precise interfaces. These components are then deployed to a dedicated hardware platform.
While in the last decade, advances in ML techniques have revolutionized various industrial domains, their use in safety-critical applications is still limited. In such domains, a certification process may prescribe an extra level of scrutiny, but existing quality assurance techniques for ML fail to justifiably ensure safe operation. In fact, incremental recertification is a major industrial driver which aims to scrutinize only those system components which are potentially affected by a change. As a result, substantial certification costs could be saved. However, when the change of a component involves the use of ML techniques, very few guidelines are available for engineers for certification.
Therefore, in the current paper, we investigate three research questions related to introducing ML components on the system level as a replacement of existing components design with traditional engineering methods in the context of gas turbine design as a representative safety-critical CPS.

RQ1 How effective is it to replace an individual (traditional)
prediction component or a chain of components (module) with ML-driven counterparts? RQ2 How effective are different combinations of ML prediction components for engineering models of gas turbines? RQ3 How to automate the deployment of a trained ML model to a CPS hardware platform to improve maintainability?
Objectives and contributions In this paper, we aim to exploit ML techniques to enhance the precision and automate the development and deployment of runtime prediction programs. For this purpose, we propose a novel family of predictors-on-chip by (1) training various ML models using design-time simulation data and then (2) automatically deploying the trained ML models to the production environment by automated code generation. Design-time physical simulations (and potentially existing field data) provide highquality data for the training of a ML model, and the trained model is deployed without further alterations to the runtime system to reduce (software) engineering efforts.
The main novel contributions of the paper can be summarized as follows: 1. We present a novel industry application of existing ML techniques in the context of gas turbine design where the penetration of ML techniques is still sparse.
2. Given the safety-critical nature of gas turbines as a CPS, we propose various deviations from common ML best practices: (a) We use only 20% of data for training and validation and 80% for testing to ensure generalized behavior. (b) In addition to traditional error metrics which evaluate the average behavior of ML components, we also investigate worst-case behavior and the direction of error (e.g., when under-estimating a parameter can be more problematic than over-estimating it). (c) We incorporate deployment constraints imposed by the target hardware platform as design consideration for the ML component.
3. We carry out an extensive experimental evaluation of various ML techniques taking a system-level perspective in the context of gas turbine design for bearing load prediction with key insights.
-ML-driven components consistently provide better bearing load prediction than existing (traditional) predictors. However, the errors of individual components may accumulate in a component chain thus violating compositionality principles (RQ1). -Replacing multiple components with a combination of ML prediction components improves upon existing traditional prediction modules, but replacing a chain of traditional components with a single ML component may be even more beneficial (RQ2). 4. We combine ML and code generation to automate the deployment of ML components to programmable logic controllers (PLCs) (RQ3).
The paper primarily focuses on gas turbine design as a motivating CPS scenario of high industrial relevance, and our contributions are evaluated in this context. However, we believe that many of our core ideas could be adapted to other CPS domains where high-fidelity runtime predictions are necessitated in a real-time execution environment. Structure of the paper After an overview of related work (Sect. 2), we discuss in Sect. 3 how ML components can be used as functional prediction components in a CPS. Furthermore, we summarize core ML concepts used in the paper. We provide a brief overview of some engineering challenges of gas turbines (Sect. 4). Then, we propose a high-level architecture Sect. 5 for integrating ML techniques and components into the design of gas turbines. Subsequently, we present the prediction module problem definition in Sect. 6. We propose load prediction module architectures in Sect. 7 along with how to use ML best practices to train various ML predictors using existing multi-disciplinary simulators of gas turbine design as data sources. We evaluate and compare such MLbased runtime predictors with existing runtime predictors as individual components as well as in prediction chains (where the output of one predictor serves as the input of another predictor) (Sect. 8). Next, we propose an automated code generation-based technique to deploy an ML model to a real controller chip with limited computing resources used in the real-time control system of existing gas turbines (Sect. 9). We discuss threats to validity (Sect. 10) and finally, Sect. 11 concludes our paper.

Related work
While machine learning, systems engineering, and runtime CPS research areas are quite mature, the intersection of these disciplines remains at an early stage of research. We categorize related literature into several main areas: artificial intelligence (AI) techniques in gas turbine design, parameter prediction, digital twins, model-driven engineering and ML, and code generation for PLCs. Load and bearing prediction Many applications of ML and AI techniques exist to various parameter prediction problems in a larger engineering context (excluding gas turbine design). For example, there exists a wealth of literature for load prediction in buildings. Unsupervised and supervised deep learning techniques for cooling load prediction are evaluated in [11]. The authors of [43] explored 12 different prediction algorithms for building load prediction. A neural network steam load predictive model was developed in [19], while a support vector machine algorithm was used in [24] to predict hourly building cooling load and evaluated in comparison with back-propagation neural networks. Suspended sediment load in river water was predicted in [21] using support vector machines and neural networks.
There also exists literature which relates to bearings in various contexts. Several ML algorithms trained on highresolution bearing fault simulations are evaluated in [38]. Support vector machines and relevance vector machines classifiers are created in [44] for fault diagnosis and classification based on testing rig data. Feature models for classification of bearing faults are created in [34].
Our paper exploits similar ML techniques for gas turbine design and expands upon them by (1) evaluating various system-level architectures, (2) assessing design decisions on ML algorithms and feature selection, and (3) proposing an automated technique to deploy a trained ML model to a dedicated (resource-constrained) production platform. AI in gas turbine design There exists significant literature concerning ML and AI techniques applied to gas turbine design. For example, a hybrid AI and numerical approximation methodology was used to produce new turbine designs in [13], and areas of gas turbine design where AI could be applied described in [30]. Additionally, there exist works on machine learning for runtime prediction: using neural net-works for internal cycle parameter prediction [9,15], fault detection [29], engine sensor and component fault and health diagnosis [16], and operating parameter prediction [22]. A number of works also study neural networks for prognosis [12,17] and propose architectures such as physics-informed recurrent network cells [27].
Our work can be regarded as a novel case study for intelligent gas turbine design with the novel aspects of (1) contextualizing our work to systems engineering and CPSs, (2) studying the impact of introducing ML-based predictor modules into an existing system-level architecture, and (3) directly deploying ML models into the control system. ML in MDE/digital twins The intersection of model-driven engineering (MDE) and ML is a rapidly growing research area. Within MDE literature, there are works applying ML for metamodel classification [28], model transformation [7], artificial intelligence for requirements aware runtime models [3], and code generation [32]. Additionally, some publications focus on domain specific languages for machine learning and code generation [6,20].
Digital twins are a sprouting research area at the intersection of model-driven software and systems engineering, telecommunication and artificial intelligence [5,26]. A digital twin prediction model for tool wear and condition is presented in [33]. A framework for developing ML models from a digital twin perspective is proposed and validated on a robot interacting with external parts in [2]. The authors of [25] developed a digital twin framework and applied it to fault diagnosis and prediction.
Our work presents new perspectives and problems in these intersecting research areas. We present methodology for applying MDE principles for automated deployment of ML algorithms. Existing works focused on using ML for MDE, and generating high-level framework code implementations of algorithms (Tensorflow, PyTorch, etc.). Meanwhile, we develop a code generator for machine learning in off-theshelf, low-level PLCs, where hardware limitations imposed by industrial environment precludes porting standard frameworks or libraries. Likewise, our work presents deployment of internal system behavior prediction and monitoring within the low-level control system rather than to external (more computationally powerful) machines maintaining digital twins. We present the efficacy and possible risks of our methodology. Code generation for PLCs Literature exists related to programmable logic controller code generation. This includes control loop code generation from defined state automata [35,40], ontologies [39,41], and formal specification languages [10]. [14] presents model predictive control on a PLC.
Our work extends previous PLC code generation capabilities to include ML such as neural networks in a manner which enables MIMO model predictive control despite the limited hardware capabilities.

Runtime predictions in cyber-physical systems
Most cyber-physical systems have a well-defined system architecture using functional components where the role of each component is to (periodically) calculate a function as output when a given set of inputs is provided (see Fig. 1). Model-based systems engineering aims to determine necessary components, define input and output requirements for each component, and connect all components together. Design decisions for components and component interactions result from domain knowledge and experience.

Definition 1 (Functional component)
A functional component Comp = (I n P, Out P, Fun) consists of a set of input ports I n P and output ports Out P in order to provide a certain functionality Fun. Definition 2 (Functional module) A functional module Mod = (I n P, Out P, Comp, Link) (aka compound component) consists of input ports I n P and output ports Out P, a set of (simple or compound) components Comp with links Link where each l ∈ Link connects (1) an output port of component c i to an input port of component c j , (2) an input port of module Mod to an input port of internal component c i and (3) an output port of internal component c j to an output port of module Mod.
Before deployment to production, such designs must undergo thorough design review and certification, which involves (multi-disciplinary) simulations to investigate and predict various characteristics (e.g., thermal, stress, reliability, performance, etc.). In case of high fidelity models, such simulations provide precise estimates, but they need to be executed on powerful server farms, yet a single simulation run may still take hours to complete.

Definition 3 (Prediction module)
A prediction module Pred = f (I n, Mod) : Out created in accordance with functional module Mod operates at runtime by calculating output data Out for each output port Mod.Out P when input data I n is provided to each input port Mod.I n P of the respective functional module.
Many of such predictions need to be deployed as part of the CPS in operation executing on dedicated target hardware within the control loop. However, due to high computational complexity, most existing design-time simulators cannot be deployed to a runtime environment, thus custom prediction components need to be developed. As a consequence, the precision of such runtime prediction components may be lagging significantly behind their design-time counterparts.
In this paper, we aim to exploit various ML techniques to train and deploy prediction components to CPSs in operation. Our approach uses design-time simulation results for training various ML models, which are then deployable to the target hardware thanks to their reduced memory footprint and small number of calculations. While at design-time, the prediction of an ML module may be less precise than a prediction obtained by using simulators, at runtime, the prediction of a ML module can be more accurate than existing traditional prediction modules.

Core machine learning concepts
Machine learning (ML) [4] is the research discipline focusing on algorithms which learn to make inferences and predictions from data without explicit logical instructions. Machine learning usually relies on heavily statistical, probabilistic, and derivative-based algorithms.
A major subclass of ML approaches are supervised ML techniques, which will be used exclusively in the current paper. Supervised algorithms require both input and corresponding target output provided and seek to learn a function which transforms the input to the target output. Applicable tasks can be either classification (outputs belong to a defined discrete set of prediction classes) or regression (outputs are predicted along continuous variable spaces). ML architecture There exist a multitude of ML algorithms with vastly different underlying techniques. However, there exist some general concepts relevant to the majority of those.
In a supervised learning context, all algorithms seek to minimize a loss function. Some algorithms are derivativebased (learn by gradient adjustment) and thereby require the loss function to be differentiable. The actual learning in most supervised algorithms learn occurs by repeatedly adjusting model parameters such as weights or probabilities in order to minimize the loss function following some optimization strategies. Examples of supervised ML algorithms include neural networks, logistic regression, support vector machines, Bayesian approaches, etc. Hyperparameters Many algorithms have extra parameters, typically designated as hyperparameters, which cannot be estimated from the data and require manual adjustment during the training process. A sample hyperparameter can be the number of neurons or layers in a neural network.
Data sets Machine learning heavily relies on the availability of high quality data, which is typically separated into three independent sets: training, validation, and test. The training set is used (at design-time) to train or fit ML algorithms. However, as these algorithms can be sensitive to perturbations in data, models must be tested on additional independent data sets. The effect of tuning hyperparameters is evaluated on the validation set. The best tuning of an algorithm, as determined on the validation set, is then tested on a final independent test set.
Algorithms may not perform equally on each data set. An algorithm suffers from overfitting when it achieves high performance on a training set, but not on the test set, which means that the algorithm does not properly generalize. An algorithm suffers from underfitting when it does not achieve sufficient performance on any of its data sets.
Specific attention may be required to decrease the number of inputs for an algorithm, as high dimensionality may result in prohibitively high amount of data or training time.

Pipeline of machine learning
The engineering of a ML algorithm follows a general pipeline.

Categorizing machine learning models
The machine learning community often refers to machine learning algorithms as models. To avoid ambiguity, we will refer to them as ML models. From a systems engineering perspective, models may exist at both design time and runtime. To place ML models in a systems engineering context, ML models will also be differentiated accordingly.
-Pre-ML models: A pre-ML model is characterized by the creation of a problem definition and datasets. These are required for ML as a starting point. -Design-time ML models: Such models are used by engineers, while the system is still under development, dominantly for training and validation purposes, which are carried out using powerful dedicated hardware (e.g., GPUs).
-Training ML Model: A training model includes training data and an ML architecture and a specific combination of hyperparameters. Typically, a wide range of training models are trained as part of tuning step. -Validated ML Model: A validated model has undergone significant training and tuning, thus it represents the best-performing training model for a given architecture on the validation set. Its performance is typically verified on an independent test set. Such predictors (for each ML architecture) are considered trained and finalized.
-Runtime ML Models: Runtime models are deployed on the real system in operation as runtime predictors. As such, they need to operate on the target computing platform (e.g., on a controller chip) to process real inputs and provide real-time output (as part of a controller loop).
-Deployment ML Model: This model evolves from a validated model which has been optimized to fulfill production requirements (memory limits, computation time, etc.). It is ready for production testing. -Production ML Model: This model is a deployment model actively running in production on the designated target hardware.
Runtime ML models can also be separated into two types based on the exploitation of runtime data: -Adaptive: An adaptive model participates in online learning during production. This model must remain trainable and must maintain relevant learning settings for its problem. This may involve keeping some of its training model configurations, but removing others. Note, the

Engineering gas turbines: an overview
Our approach and methodology arises from engineering challenges faced by Siemens Power and Gas. First, we provide an overview of the engineering context.

Engineering context
Gas turbines (Fig. 2) are a common type of internal combustion engine used to generate power. Due to a multitude of complex physics interactions (fluid interactions and combustion), thorough design validation is required to ascertain that the control system properly regulates engine behavior to prevent failures. An event which can trigger a damaging engine failure is an over-load or under-load of a bearing. As a result, significant engineering efforts are spent at design time to ensure the engine maintains appropriate load on the bearings at all times. Design time Some engine models integrate an active bearing load control component managed by the control system. For such a component, the control system expects the current load on the bearing and the target load as input. Unfortunately, no sensors exist to measure the actual bearing load, thus it must be estimated from other sensor information and engine parameters. In current engineering practice, engineers need to manually create predictors which estimate the bearing load, which is a very time-consuming process. Due to high costs of manufacturing and engine testing, the majority of design validation is performed by running engineering simulations (mainly physics-based equation solvers for simulated conditions on physics engineering models). Such simulations help engineers define safety envelopes, i.e., parameters within which the engine can safely run.
Engineers run simulation tools and manually extract relevant data from the results, which is then used to determine correlated parameters to form the basis of generated prediction functions. These functions are then integrated into load prediction modules in the control system. Runtime At runtime, the engine system runs the control system code. Sensor readings are forwarded to the load prediction module, which uses the engineer-developed correlation functions to predict what the current bearing load is. Using this prediction, the control system controls actuators to adjust the active bearing load control component to shift the current bearing load toward a defined target load.

Simulation of physical engineering models
The design of gas turbines is carried out with many teams along several disciplines which include whole engine, secondary air system, aerothermal, thermo-mechanical, etc. Teams develop models for each system component within their discipline and, at one stage, each team's designs must unified for the entire engine as a whole.
Engineering models are developed within engineering tools and are heavily used during the design and validation of a model. Models are changed to test new design ideas and simulate how they perform. Physical simulations are run on models which involve solving of physics equations (flow, force, etc.) across the model. The majority of the engineering simulations use iterative physics solvers which attempt to converge for each parameter in the model to a set threshold. For a given model, a convergent solution cannot always be found by the simulation solver. Simulations are evaluated across the range of engine operating conditions. Engine disciplines, of course, interact and depend upon each other. Thus, periodically, models must be converged. Due to model interdependence, each discipline model contains input parameters from other models. Such parameters remain unchanged until changes are approved from the other disciplines. To converge models, simulations are run in a loop, until model parameters cease changing. This requires a loop until convergence, as models affect each other. Each small parameter change within a model's simulation run affects the inputs of other models, which, in turn, will affect the model. This is illustrated in Fig. 3. In the scope of this work, we incorporate three relevant gas turbine engineering disciplines, namely performance, aerothermal, and secondary air system.

Performance discipline
Modeling and analysis of whole engine behavior is the basis of engine design and enables decisions on which components engineers should focus on to best improve engine operation. Whole engine modeling is the focus of the performance discipline, and primarily uses a 1D thermo-dynamic meaning direction is limited to one spatial dimension.
This model maintains a static directed graph, with each node relating to an engine component, station, or even onengine sensor. Edges connect various engine components together and define flow paths. The model requires the abstraction of components (blades, compressors, etc.) into simple factors (efficiency, capacity, etc.) which form the attributes of the graph nodes.
Definition 4 (Directed attribute graph) A directed and attributed graph D AG = (N , E, A, src, trg, attr) has a set of nodes N , a set of edges E connected to source and target nodes with respective functions src : E → N and trg : E → N . Extra attribute values can be assigned to nodes and edges using a predefined set of attributes A as attr : Definition 5 (Engineering performance model) An engineering performance model is a directed attributed graph E P M = (N , E, A, src, trg, attr) with defined engine components and stations as nodes N , flow paths between components as edges E with sources src and targets trg, and attributes attr such as pressures, flows, and efficiencies. The engineering performance model allows an engineer to run the engine under substantially different operating conditions, using limits (for example CO, turbine temperature, turbine speed, NO x , etc.), and other settings. During simulation, an iterative convergence algorithm is executed to find the behavior at each operating point (temperatures, pressures, flows, rotation speeds, etc.). These determined behaviors are highly important for control loops during the runtime of the engine as they provide a theoretical baseline for what should be occurring across the whole engine. However, as performance simulations are computationally too complex to run within the engine control system, some of the behavior calculations are approximated using runtime predictors.
In this paper, we use the performance model to create an array of operating points (varying temperature, humidity, engine power output, etc.), which will generate the on-engine sensor readings for each point. Then, these values will serve as the input to a ML-based predictor of bearing loads.

Aerothermal discipline
Engines are designed to rotate turbine blades as efficiently and quickly as possible to generate power. The aerothermal discipline aims to design compressor and turbine blades.
Designing the airfoil and its cross sections of a turbine blade involves the use of 2D and 3D models to balance the aerodynamic properties of the blades with lift. Such simulations involve solving Bernoulli equations to determine stresses, lift, efficiencies, the required amounts of flow, temperatures, and pressures for the turbine and compressor blades to function properly. Aerothermal flow requirements are used as input to run secondary air system models under correct assumptions.

Secondary air system discipline
Gas turbines endure high temperatures and pressures as a result of compression and combustion. Thereby efficient cooling is paramount to maintain proper engine function, structural integrity, and significant engine life. The secondary air system seeps air from the main intake path and guides air through secondary passages to cool parts of the engine, primarily discs, as well as to balance pressures. These passages are lined with seals and restrictions to meter the flows required in the cooling process.
The vast majority of secondary air system design time is spent working on a 1D model, which is represented as a directed graph. Each graph node (with a set of attributes) corresponds to a component (e.g., inlet, seal, pump) within the secondary air system and graph edges form connections (paths) between components. A simulation run involves setting the attributes of the engineering SAS model (e.g., pressures and temperatures from the combustion process), and iteratively solving the physics flow equations for the remaining node attributes until convergence. Pressures, temperatures, and loads are calculated on the different components within the graph.
The SAS system must behave correctly at runtime otherwise the engine will suffer large stresses and high levels of heat. Because each SAS simulation run is a computationally complex operation, certain SAS predictors must exist at runtime to help the control system determine proper behavior and prevent safety-critical operation errors.
In this paper, we propose to create the data necessary to train predictors for runtime bearing load prediction by running the secondary air system model simulations across a vast number of operating conditions and storing the outputs. The solver-calculated pressures and loads are used to calculate bearing loads.

Objectives
In order to decrease engineering efforts and better maintain proper bearing loads, the challenge is (1) to improve predictions, and (2) to automate the process of developing and deploying predictor components.
These high-level challenges trigger three main technical challenges in a systems engineering context: 1. Improve quality of runtime predictions: Better runtime predictions of bearing loads result in improved maintenance of target loads, improved engine life, and prevention of safety-critical failures. 2. Automate the design process of predictors: An automated design process saves engineers significant time which can be better spent on other tasks. An additional benefit is that more data and simulations can be ran in the background yielding better predictors and evaluations. 3. Automate deployment of runtime predictors to dedicated hardware: The hardware on which the control system of such engines is developed is a Programmable Logic Controller (PLC). PLCs are primarily designed to be robust, consistent and capable of functioning in any environment. As such, these controllers tend to be weak by modern computation standards, as they are singlethreaded, they disallow dynamic allocation, and do not optimize provided code. This forces the deployed predictor to be computationally efficient.

System architecture for load prediction
This section presents an architectural overview of the proposed bearing load prediction system for gas turbines (illustrated in the form of a SysML block diagram in Fig. 4). The architecture incorporates three stages of development: design time, deployment, and production. It details the process starting from running engineering design tools to physically running a real-time engine bearing load predictor in the control system. The design phase begins after there exist approved versions of performance, aerothermal, and secondary air system design models. These models enable simulations of each of the corresponding subsystems. A mixture of data from all three subsystems is needed to develop the predictor in the Predictor Creator module. The simulations of the various systems are connected as a chain. The chain begins with a performance model being run at an operating condition. Performance outputs are required for aerothermal simulations to be run, and the outputs of both are required by the secondary air system model. A Tool Runner was developed to automate and launch each of the analysis tools across a desired set of operating conditions and to organize simulation output data in a way which maintains consistency between each of the performance, aerothermal, and secondary air system outputs. This data organization is crucial as it enabled data querying used for data distillation. The Distiller filters out important parameters and combines pieces of data to create data sets for machine learning. The distilled data sets are used in the Prediction Creator, in which an ML predictor is developed (i.e., a neural network in our case). The resulting predictor is deployed onto a PLC in the deployment phase. The predictor is saved in a customized JSON format which includes the model type, connections, and weights within the network. The predictor is provided to the Code Generator module which contains a Source Code Generator and Configuration Generator. The former creates PLC source code for the runtime predictor including instructions for activation functions and proper execution order of instructions. The latter packages this source code together as well as generates the configuration requirements such as data structures, and metadata such as versioning. Using these configurations and source code, the predictor is uploaded onto the PLC as a routine.
In production, on-engine Sensors are read periodically and their values are delivered to the Load Predictor. The Load Predictor uses these input values and updates the current bearing load predictions. These predictions are used by the Active Load Controller to calculate how to adjust the actuator to increase or decrease the bearing load to the target value from the prediction. A change induced by the actuators causes a change in the bearing load. This loop repeats indefinitely until engine shut down. Industrial benefits The adoption of this architecture and the technical details presented in the following sections have yielded several notable benefits for Siemens Energy: 1. Improved predictions: ML-based prediction modules have demonstrated improvement upon existing traditional prediction modules wrt. pre-defined metrics. For example, mean absolute error was reduced by up to 60x by using ML-based predictions (see Sect. 8).

Process time saving:
The process of automating data generation and component creation periodically saves 20+ engineering days. As this is a repeatedly recurring process, it translates into significant cost reduction. 3. Versioning: Due to savings in process time, if engineering models are updated, new, improved, versions of prediction modules can be deployed. 4. Improved control software: Thanks to automated code generation, the process of writing, debugging, and testing prediction modules is significantly simplified, simultaneously improving product quality and decreasing costs. 5. Multi-platform support: Code generation enables a single ML model artifact to be deployed to multiple production platforms (e.g., different-vendor PLCs).

Prediction module problem definition
Below, we define the requirements, evaluation and deployment criteria for the prediction module, and present an existing (traditional) solution for the problem.

Requirements for the load prediction module
The load prediction module must fulfill certain design requirements (related to inputs, outputs and computational cost of operations) as well as specific deployment constraints in order to operate in the existing engine architecture.
-Inputs: Inputs are limited to any subset of the available on-engine sensors. These inputs can be considered as valid-there exist other mechanisms in the engine to test sensor validity-the sensor readings will not be woefully unreliable/inaccurate. -Output: The output of a given load prediction module is a predicted bearing load provided in the standard units employed in the control system. Prediction will take place on three (real) bearings referred to as Bearing 1, Bearing 2, and Bearing 3. -Computation budget: Computation is limited to at most L computational operations (e.g., addition, multiplication, subtraction, reads and writes). This computation limit is imposed by constraints of the target hardware platform on which the predictor is deployed to guarantee in-time prediction (see Sect. 9 for further details).
The prediction module must be capable of being deployed to a 32-bit programmable logic controller (PLC). The deployed module must adhere to the following constraints:

Evaluation criteria
Our experimental evaluation is based on normalized bearing load prediction error (where smaller is better). Each prediction model is evaluated on the following four metrics and compared with the existing traditional prediction module presented in Sect. 6.  MAE and RMSE are standard ML evaluation criteria and help understand the average performance for a given predictor. MO and MU are motivated by the engineering context where worst case scenarios are highly relevant. Bearing failures can be safety-critical when a bearing is overloaded (result of underprediction) or underloaded (result of overprediction). Thus, it is imperative to evaluate how well a prediction module behaves at the extremes. As such, we intentionally deviate from ML best practices to adapt them to a safety-critical engineering context.

Existing load prediction module
The existing (baseline) load predictor module is a sequential composition of two prediction components (Fig. 5). Our baseline prediction architecture represents the current engineering best practice at Siemens for analyzing gas turbines.
-The first component is a correlation map which correlates (a subset of) on-engine sensor readings to some unmeasured internal engine parameters such as pressure. These internal parameters exist as attributes in the performance discipline's 2D thermo-mechanical model. -The second component is a correlation map from these internal engine parameters to the bearing loads. The bearing loads were determined via calculations obtained from the 2D secondary air system model.
Each prediction component was developed individually with performance discipline engineers developing the first, and secondary air system discipline engineers developing the second. These two components were then combined to form the load predictor module. The two components, for clarity, are henceforth referred to as performance component and SAS component.
The internal engine parameters (output of performance output and input to SAS component) have been traditionally used in the prediction chain due to the strong physics interlinkage between them and the bearing loads.

ML-based load prediction architectures
This section proposes several ML-based load prediction module architectures together with the underlying ML models.

Proposed prediction module architectures
We propose a number of purely ML-based and hybrid architectures for the prediction module (visualized in Fig. 6 We propose these architectures to study the replacement of modules and components in the existing context of gas turbine design. These architectures cover all permutations of the prediction chain in the existing module. Each architecture will be instantiated with two sets of starting sensor inputs designated as limited and all. The limited set includes only sensors used as input to the existing traditional prediction module. The all set will enable the prediction module to access all available on-engine sensors. Modules with limited sensors will be tagged as and modules with access to all sensors as (all).
For each architecture, we experiment with two subclasses of ML techniques using Bayesian ridge regression and feedforward neural network described in Sect. 7.2. In the 2ML architecture, we study the effect of different combinations of underlying ML models.

Background on ML architectures
Two ML architectures are explored in this paper: (1) Bayesian ridge regression and (2) feed-forward neural networks (NNs).
-Bayesian ridge regression was selected due to its low computation requirements (which is beneficial for deployment to PLCs) and effectiveness in various application contexts (e.g., Bayesian approach fits to existing data well). -Neural networks have demonstrated to successfully learn complex functions from a variety of data sets. Extremely large neural networks and deep learning techniques are not explored due to the hardware limitations of the underlying PLC controller.
While we also studied and experimented with other ML architectures, we restrict our presentation to these two techniques in the paper to limit the number of combinations of hybrid components.
We provide formal definitions for the core ML models used in the context of the paper.

Definition 8 (ML model) A machine learning (ML) model M L = (x,ŷ, Data, Ar ch, H P) consists of a vector of inputs x, a vector of outputsŷ, a (training) data set Data with pairs of associated input and output values (x (d) i , y (d) i ), an ML architecture Ar ch and a set of hyperparameters H P.
During prediction, the model applies a function f M L to the input to predict the output, i.e.,ŷ = f M L (x). The function is determined during training by minimizing a given loss function based on training data. An architecture Ar ch implicitly includes a loss function loss(ŷ, y) = val which calculates a metric value val, measuring how different a set of predictionsŷ is from its corresponding set of actual values y. Common loss function metrics include mean square error and cross entropy loss [4].

Bayesian ridge regression
Bayesian ridge regression is a linear ML model, which predicts only a single value as output.

Definition 9 (Linear ML Model) A linear ML model L I N = (x,ŷ, Data, Ar ch, H P)
has an architecture Ar ch with a (single) output prediction defined as a weighted sum (ŷ = f L I N (x) := i w i · x i ) of the input variables.
Weights w i within a linear model are determined by some algorithms which minimize a defined loss function on training data. Due to the simplicity of linear models, they are fast to train and compute, and they have low storage needs for parameters (one weight value for each input value). An additional benefit of their simplicity is that they rarely overfit: they have very few parameters and must learn to fit the overall trend of data. On the other hand, they may not properly learn to handle outliers well.
There are many techniques to determining suitable weights. Bayesian ridge regression determines a linear ML model using Bayesian inference. This linear approach is more robust to outliers and it can also provide confidence level in prediction as a measure of variance from previous data. [42]) Bayesian ridge regression B R = (x,ŷ, Data, Ar ch, H P) is a linear ML model whose architecture relies on Bayesian inference to determine the weights under two assumptions:

Definition 10 (Bayesian Ridge Regression
1. The prediction has a probability density which is normally distributed p(ŷ|x, w, α) = N (y|μ, σ 2 ), where μ = w T x and σ 2 = α is some internal parameter. 2. A prior on the weights is normally distributed with a mean of 0, i.e., p(w|λ) = N (w|0, λ −1 I p ), where λ is an internal parameter related to precision.
Formally, the output of the Bayesian ridge regression model is a (normal) distribution, which is then turned into a single value predictionŷ by taking its mean. An iterative learning algorithm jointly estimates the parameters, w, α, and λ where α and λ are estimated by maximizing the log marginal likelihood.

Neural networks
Neural networks are composed of artificial neurons which apply an activation function to their inputs to get the output. (x, y, Bias, f a ) consists of a vector of inputs x, a single output y, a bias term Bias and activation function f a . A neuron applies (activates) the activation function to the sum of its inputs and adds its bias term to generate its output, i.e., y = f a ( i x i ) + Bias. Activation functions used in the paper include the rectified linear unit, ReLU (x) = max(0, x), and linear, L I N E AR(x) = c · x where c is some constant.

Definition 11 (Artificial Neuron) An artificial neuron an =
Neural networks are composed of layers of neurons where signals are sent from one layer to another via connections. The first layer is the input layer, the last layer is the output layer, while internal layers are called as hidden layers.
Definition 12 (Artificial Neuron Connection) An artificial neuron connection anc = (src (i) , trg ( j) , w) leads from a source neuron src (i) (in layer i) to a target neuron trg ( j) (in a subsequent layer j) with a given weight w. When neuron src (i) activates, the neuron connection multiplies the output of src (i) by its weight w and sets the k-th input element of trg ( j) as the product (trg ( j) .x k = w · src (i) .y).

Definition 13 (Artificial Neural Network)
An artificial neural network ann = (x,ŷ, AN , AN C) contains a vector of inputs x, a vector of outputsŷ, a set of artificial neurons AN arranged in n layers (where each neuron an (i) ∈ AN belongs to exactly one layer 1 ≤ i ≤ n), and a set of artificial neuron connections AN C. The input x is composed of (the single) input of neurons in layer 1, i.e., x i = an (1) i .x, while the outputŷ is composed of (the single) output of neurons in layer n, i.e., y j = an Neural networks learn from data by adjusting the weights of their connections. Such adjustments are performed by an optimization algorithm (optimizer) aiming to minimize a given loss function. Many such optimizers are derivativebased, which impose further assumptions on the loss and activation functions.

Bayesian ridge regression
We use the implementation of Bayesian ridge regression in the Scikit-Learn library [36]. This implementation uses four hyperparameters (α 1 , α 2 , λ 1 , λ 2 ) to provide Gamma distribution priors over the α and λ parameters. We set α 1 = α 2 = λ 1 = λ 2 = 10 −6 to maintain non-informative priors at the start. Training occurs by fitting to the training data set.

Neural networks
Architecture setup We use the TensorFlow framework [1] to develop NNs. A similar architecture (Fig. 7) is used for all cases: all neurons within hidden layers use the ReLU activation function, and output neurons have linear activation functions. ReLU enables nonlinear learning and the linear activation allows for a wide range of bearing load predictions.
Each NN uses the mean square error loss function and the Adam optimizer [18] to optimize connection weights. Mean square error was chosen because it is differentiable and penalizes large outliers. The Adam optimizer is a standard optimizer used in ML; it has been found to be an effective optimization algorithm for most problems. Training Each neural network is trained for 1000 epochs (with data shuffled at the end of each epoch and a batch size of 32) over the training data set. Early stoppage is not used as no overfitting was experienced [31]. Tuning hyperparameters Given the computational constraint L imposed to obtain deployable ML models (see Sect. 6.1), we classify neural networks into two complexity categories. Simple neural networks are allowed to perform at most L/2 computational operations, while complex neural networks are given a constraint of L computational operations.
While enforcing these computational constraints on the number of operations, we tune the number of neurons, the number of neurons per layer, and the number of neuron layers as hyperparameters. The best performing combination of hyperparameters for each simple and complex NN (as measured on the validation set) is then evaluated on the test set.

Data and feature engineering
Our process includes generating data, distilling the data into a dataset, feature engineering, and splitting data into training, validation, and test sets.

Data generation
Data is generated from real engineering model simulations. The appropriate engineering models for each performance, aerothermal, and secondary air system are provided by engineers from each discipline.
A tool runner (script) launches engineering performance, aerothermal, and secondary air system model simulations sequentially. The engine operating conditions are defined for each performance run (engine power, altitude, ambient temperature, etc.). The performance outputs are then fed as inputs to the secondary air system model simulator with aerothermal parameters.
The tool runner runs simulations across a vast number of operating conditions to generate more than 150,000 engineering data points (while traditional practice relies on less than 200 points). The simulation results are organized and saved into a storage location that can be queried.

Data distillation
The data distiller links data across each of the engineering discipline simulations into a common dataset for training ML models. The distillation process extracts on-engine sensor values and internal-engine parameters from performance simulations, and parameters necessary to calculate bearing loads from the SAS simulations. If a simulation run terminates with an error, (e.g., convergence failure) then this run is excluded from the distilled data set.

Feature engineering
Due to the rigorous requirements of our engineering context, we have clearly defined inputs (sensors) and outputs. As such, real feature engineering is limited to determining bearing load values from SAS simulation data. This involves a sequence of arithmetic calculations on a number of distilled SAS parameters.

Data sets
We split the data set into training, validation, and test sets. We randomly select 10% of data for training, 10% for validation, and 80% for test.
We deviate from common ML practice of selecting a high percentage of data for training and a much smaller percentage for test due to the safety-critical nature of the prediction problem. We must ascertain that our prediction module generalizes well to unseen cases. In our engineering context, our prediction module needs to learn an underlying physics interaction or equation. As the module is learning to approximate solutions to a series of physics equations (only changes are variable values), an excessive number of data points should To experimentally justify the 10/10/80 data split, Fig. 8 presents RMSE results for 1ML architectures, i.e., for a single (individual) prediction component. There are minimal differences between training, validation, and test set scores, and only for one case is there a larger RSME result on the test set than encountered in training or validation. MO and MU values metrics increase in the test set as compared to training, but with a larger sample of data more outliers are expected. The increases in MO and MU values are within a reasonable range. While we limit our presentation here to only 1ML architectures for simplicity, all ML models tested exhibit minimal differences in RMSE between data sets.

Experimental evaluation
In order to evaluate RQ1 and RQ2, this section presents results for each of the 1ML, HSML, HPML, 2ML architectures presented in Sect. 7.1 evaluated on MO, MU, MAE, and RMSE metrics as described in Sect. 6.2. Evaluations are conducted on the test set described in Sect. 7.4.4 consisting of 120,000 data points for each architecture. Moreover, each architecture implementation is compared with the existing traditional prediction module as a baseline.
Results (normalized) are presented for three bearings: Bearing 1, Bearing 2, and Bearing 3. Extra details are provided for Bearing 1, and an overview for all three. These results represent predictions in real Siemens engines; precise

1ML: single ML model
First, we create ML models which predict Bearing 1 load from the on-engine sensors (both lim and all) as a single AI component. We explore the performance of each ML model architecture and effect of the different inputs sets of sensors.
We compare six prediction modules with lim and all sensors for each of the linear, neural network (simple), and neural network (complex) architectures described in Sect. 7.1). Figure 9 presents the evaluated results.
Observation 1 suggests that ML-augmented prediction modules can improve on traditional techniques, but they may not always perform better for every metric. Observation 2 highlights that certain ML models can exhibit poor performance, but this was not a common case. Observation 3 reveals that the ML module benefits from access to more sensor data. Observation 4 is most easily observable when comparing neural network (simple) with neural network (complex). While neural network (complex) performed better with respect to MAE (Improvement of up to 60x) and RMSE, it performed worse in terms of MO and MU. This is important to note when deciding on a deployment model to

HSML: hybrid model with ML SAS component
In this section, we present results for the HSML modules. We begin by evaluating the SAS components individually with respect to the existing traditional SAS component and then present the evaluation of the entire hybrid module. Figure 10 presents results for SAS components evaluated independently (not full prediction module). When analyzing these results, there are several key observations. Observation 1 suggests there exists a propagation of error across components in the existing traditional module. Errors in the performance component further propagate with inaccurate predictions in the SAS component. Observation 2 presents further evidence that ML-augmented components can improve upon existing techniques. Observation 3 shows that components can be biased toward a specific metricclearly the existing traditional SAS component is biased toward underprediction. Observation 4 shows that even more complex models (beyond computation limit imposed) could yield even better performance metrics. Observation 5 suggests that all on-engine sensors could provide more relevant We have several ML model components which could serve as replacements for the existing SAS component. Each of these ML model components perform better in regards to MAE and RMSE. We now replace the existing traditional SAS component for each of these new ML model components in the existing module.
The existing performance module is kept unchanged and the existing SAS component is swapped for each trained ML model component in Fig. 10. Thus, the traditional performance component predicts the internal parameters from the limited sensor input and a SAS component predicts the Bearing 1 load. Figure 11 presents results.
Observations 1 and 2 show that replacing existing components with ML-augmented components can improve module performance, but improvement may be limited by other components. Observation 3 is particularly important as it is a negative result. We observed that improvement in a single component in a system could actually decrease the overall performance of the system, thus violating compositionality for predictor components. It is therefore paramount to reexecute integration and system-level testing to investigate the system behavior in its entirety before deploying "improved" components.

HPML: hybrid model with ML performance component
We present the results of a prediction module with the HPML architecture. We do not show independent ML performance component evaluation as it is difficult to present concisely (multi-parameter) and provides the reader little value. Figure 12 presents the results of HPML architectures. Observations 1, 3, and 4, highlight that ML-augmented components cannot guarantee improved performance in all defined metrics. Observation 2 supports the hypothesis of compounding propagated error between components. With an improved performance component, the overall module MAE and RMSE approached the MAE and RMSE of the existing SAS component individually.

2ML: dual ML model
In Sects. 8.2 and 8.3, we developed a number of MLaugmented SAS and performance components and evaluated hybrid models mixing existing components with the new ML-augmented components. In this section, we study the 2ML architecture, by connecting two ML model components to form the prediction module.
We present a subset of the available permutations which we believe provides the most value to the reader (more permutation results presented in Sect. 8.5. In this section, we compare all permutations arising from linear (Lim.) and NN (All, Complex) for the performance component and all three of linear, NN (Simple), and NN (Complex) for the SAS component. Each of these are presented in relation to the existing module in Fig. 13.
From Observations 1 and 2, we show that different ML model components have different sensitivities to input parameter noise. There clearly exist strong interaction effects when component models are replaced. Unfortunately, these effects are not easy to predict. Components must be evaluated as integration to a whole. Observation 3 presents that with accurate (MAE, RMSE) prediction components, the propagation of error is quite low.

Summary
In the previous subsections, we presented results for Bearing 1 prediction modules incorporating the 1ML, HSML, HPML, and 2ML architectures. Now, we summarize our results for each architecture for Bearing 1, Bearing 2, and Bearing 3. We omit results pertaining to limited sensor input (except for the existing traditional module) as modules limited to these sensors achieve much worse performance.
Results for Bearings 1, 2, and 3 are presented in tournament style tables in Fig. 14. Pairings were done in sequential order of Existing, 1ML, HSML, HPML, and 2ML module architectures. Note: certain modules may have advanced further (or even won) in the tournament if paired with different modules-important observations can be found by comparing non-competing modules. To help the reader navigate through the table, only the better metric result for each "competition" is coloured.
Observations 1 and 3 suggest that ML modules can be effective replacements for existing prediction modules. Best performing 1ML modules (in terms of MAE) achieved significant 60x, 22x, and 17x MAE reduction for Bearings 1, 2, and 3, respectively. Observation 2 suggests that integrated one-component modules may be a recommended option for prediction modules to decrease error propagation as well as computation costs. As a final remark, no ML model trained for each prediction task suffered from overfitting.

RQ1: How effective is it to replace an individual (traditional) prediction component or a chain of components (module) with ML-driven counterparts?
-Experimental results for predicting bearing loads in a gas turbine engine suggest that ML-driven prediction components can be more effective than traditional prediction components. -Results also show the errors of individual components may propagate along a chain of prediction components thus degrading end-to-end prediction performance. As such, the principle of compositionality is violated. -As a consequence, special attention is needed for the replacement of individual components within a chain as an "improved" component may degrade the overall prediction of an entire module (component chain).

Deployment
Next, we present an automated technique for creating deployment models from validated models (with a focus on neural networks due to their complexity) and deploying them into production in the engine control system running on a programmable logic controller (PLC) hardware. We propose to use code generation techniques that treat ML models as simple deployment model artifacts.

Hardware
The hardware on which the control system of such engines is running is a Programmable Logic Controller (PLC). PLCs are primarily designed to be robust, capable of functioning in any environment, and consistent. As such, these controllers are single-threaded and do not support dynamic memory allocation or code optimization. Real-time programs loaded into a PLC controller are looped continuously in time intervals until the controller is turned off. Each program is given a time frame (e.g., 10 ms) within which it must complete. It is common practice to assign different priorities and allowed time to different realtime programs. Programs with greater priority may preempt or interrupt lower priority programs, execute, and then relinquish control. This is illustrated in Fig. 15.
Given that the controller has a slow processor and the control system has strict, hard real-time requirements, any deployment model must be computationally efficient to be able to complete its prediction in time. Deployment to PLCs is currently limited to nonadaptive models.

ML model format
A feed-forward neural network model is simply a mathematical function, composed of smaller mathematical operations occurring in layers. As such, it is possible to encode, or represent, the model as a sequence of operations in a standardized format. This is the basis for how Tensorflow, Keras, Pytorch, MATLAB, etc. allow one to save a model and reload it later. However, each contains its own standard, and for the purposes of a PLC, encapsulates extraneous information, such as training-time parameters and behavior. Thus, we developed a new model representation for storing validated models for PLC deployment to prevent vendor lock-in.
Our neural network ML model uses the JavaScript Object Notation (JSON) format. The JSON file maintains a list of layers (feed forward neural networks can only receive inputs from previous layers). For each layer in the model, the following attributes are maintained: -Name: The given name for a layer will be used to refer to layer computations in the generated PLC code. -Activation function: The generated PLC code will call appropriate activation function instructions, which is assumed to be identical within a layer. If multiple activation functions are desired in a layer, the layer can be split into two parallel layers.

Source code generation
To deploy the JSON encoded neural network onto a PLC, we use code generation techniques. Given that we have a unified encoding of a sequence of mathematical operations, we need to correctly unfold the sequence, and apply the proper operations at the right time.
PLC code can be composed of routines and functions. Routines define an execution order of instructions and allow for jumps to other routines. Upon completion of a routine which was jumped to, the caller routine continues execution. Functions are reusable pieces of code which can be called in a routine with provided variables.
The prediction model deployed to the PLC uses the following generated logic. As feed-forward neural network can only receive input from previous layers, the activation of layers occurs sequentially where each layer has its own routine. Each neuron activates in a layer, and then the next layer's routine is run. This process is repeated until each defined layer in the model is generated.
A sample "main" routine which executes the core logic of a prediction program is presented in Listing 2. Listing 2 "Main routine" example.

Activation functions
Activation functions are generally simple mathematical operations that take one input and return one output. Each neuron in a neural network requires an activation function. The code generator contains templates for standard, named activation functions (ReLU, Linear, etc.). As usual, templates for each PLC programming language are derived from previously developed code.
Definition 15 (PLC activation function) A generated PLC activation function P AF = (I n, Out, f n, N ame) defines the input memory location I n, the output location Out, the function code f n applied to the input to get the output, and the activation function name N ame.
We use a naming convention to name each activation function as NN_ActivationFunctionName (e.g., NN_Linear) for consistency and to create a library of (template) functions for the PLC.
The code generator extracts all types of activation functions used within the NN from the validated ML model and then extracts the proper activation function PLC code from its template library. Each activation function is generated as its own independent function and will be used as such. When a neuron "activates", it calls the relevant function from the defined function sets. An example of what a RELU activation function template would like is presented in Listing 3.

Layer generation
Each neuron layer has two generated routines from the validated ML model described in Sect. 9.2. Each layer contains references to defined arrays with naming convention OUT-PUT_LayerName. Each array element serves as a storage location for neuron; it captures neuron input value and provides the output value post-activation.
Determining the output for each neuron occurs in two stages: (1) first, we compute the input (routine referred to as APPLY_WEIGHTS_BIAS_LayerName) and then (2) we apply the activation function to get the output (routine referred to as APPLY_ACTIVATION_LayerName). The input is determined by adding the neuron's bias term with the sum of products between weights and their corresponding neurons outputs from previous layers (referenced by their corresponding OUTPUT_PrevLayer names, see Listing 4. This input is stored in the array, after which the activation function is applied (Listing 5) and overwrites the array value.

Configuration generation
PLC code must be packaged in a way that enables deployment into the control system. We define a configuration file which contains named routines with auto-generated source code, initializes all data structures, and metadata. Routines Routines are composed of generated source code. For the purpose of consistency for code generation, each routine follows a defined naming convention.
-APPLY_WEIGHTS_BIAS_LayerName for routines that compute neuron inputs. -APPLY_NORMALIZATION_LayerName for routines that apply normalization. -APPLY_ACTIVATION_LayerName for routines which apply the activation function for each neuron in a layer.
Data structures Each layer in a NN needs three defined data structures: an array for weights, an array for biases, and an array for storing layer neuron outputs. Each array is composed of floating point numbers. For optimization purposes, the weight and bias arrays are defined as constants.
Metadata The generated configuration file defines several key pieces of data such as the author, the generation date, and the program name.

Code optimization
In the context of the control system, the code of the deployed predictors need to run at real-time, thus the efficiency of the code is important. However, via a series of experiments, we determined that not all PLCs have effective compilers.
Optimizing changes in the code syntax (oftentimes to the detriment of readability) could have profound changes on the execution time of code. One such optimization revealed via experimentation was to avoid separating variable operations into multiple lines. Due to a lack of compilation, it appears the PLC performs extra unnecessary read and write operations. One can achieve 25%+ faster execution times by only setting a variable (var = . . .) once and keeping all operations on one line of code, albeit, the line may become 1000+ characters long if a layer is composed of many neurons. OUTPUT_HIDDEN  Obviously, code performance improvements are more important than readability for such a prediction module, especially, since the code is auto-generated and it should not be touched by control systems engineers.

Physical deployment
Unfortunately, physical deployment cannot be fully automated in the context of gas turbines. The process of physically deploying the prediction module to the control system involves control system engineers importing the generated code and configurations, properly connecting the on-engine sensors to the module, and uploading the control system code onto the physical PLC.

Deployment constraints for ML architectures
A real-time program must guarantee to complete within its allocated execution time. The available time is imposed by the PLC hardware, which may define restrictions and constraints for the underlying ML architectures. Even if a given ML architecture performs particularly well during training and validation, it cannot be used for runtime predictions if the architecture cannot be deployed to the PLC. As such, our goal was to identify some deployment constraints for ML modules that can be enforced at design-time to enable their deployment to the production environment.
Through a series of experimental tests on physical PLC hardware, we are able to determine how much time certain floating point, integer, read and write operations require. This knowledge can help us place constraints on neural network training model size (Sect. 6.1).
The majority of NN computation is spent on the multiplication of neuron output values by their respective weights. Thus, the total computation time of a NN model is dominated by the number of existing connections. Using our experimentally determined hardware computation times, we can put a conservative constraint on the number of connections in a neural network training model to ensure that the deployed module will satisfy timeliness related requirements.

Overview
In this section, we used code generation to automate the creation of optimized ML deployment models for a specific production hardware to address RQ3. RQ3: How to automate the deployment of a trained ML model to a CPS hardware platform to improve maintainability?
-Automation of trained ML model deployment can be accomplished via code generation techniques to develop source and configuration files required for a PLC hardware platform. ML models and relevant metadata can be represented within template artifacts. Given code generation and configuration scripts, complex ML models can be maintained (upgraded, replaced, etc.) without significant engineer involvement.

Threats to validity
Construct validity To limit threats to construct validity from a design perspective, our test metrics directly relate to inputs and outputs of the existing prediction module and the correctness of simulation data was validated by Siemens engineers. While we exclusively rely upon simulation data (instead of real field data) for training, this is unavoidable in our engineering context as bearing loads cannot be measured directly.
In addition, the output of our predictors has been validated by using independent system-level runtime simulators regularly used in engine design. During those tests, no warnings or errors were reported. Internal validity We mitigate threats to internal validity by using implementations of ML techniques from trusted and popular ML libraries (Scikit-Learn and TensorFlow) and following machine learning best practices. Likewise, we compare predicted results to actual design time results and validate on 120k unique simulation runs in the test set for each of three bearings of a real gas turbine engine. For validation, each of these three bearings underwent black-box and system-level testing. Additionally, to increase the level of confidence in predictions, we provide metrics evaluating worst case over and under-prediction in addition to standard error metrics used in ML.
External validity While we carried out extensive evaluation of ML-based predictors in the context of gas turbine design, we do not claim that similar results and findings would be obtained for other CPSs. In particular, -The data sets within this paper arise from specialized nonchaotic physics simulations. For other data sets, other ML techniques may be more effective or may not perform better than existing traditional predictors. -Our deployment optimizations are specific to PLC hardware and the compilers which were used, and thus may not generalize to other hardware platforms. -While in our context, it is frequently more effective to replace earlier prediction components in a chain, this may not always be the case (e.g., information loss such as lossy channel between).
On the other hand, our key negative finding that the principle of compositionality can be violated within chains of prediction components should generalize (by definition). We do not see any inherently special characteristics of our system which would limit this system-level finding to the context of designing prediction components for gas turbines. As such, developing ML-based predictors which exhibit compositional behavior is a major open challenge.

Conclusions
In this paper, we addressed the problem of applying and deploying machine learning predictors to gas turbine design from a systems engineering perspective. Given the safetycritical nature of gas turbines and the interdependence of existing subsystems developed by coordinated efforts of many multidisciplinary engineering teams working on individual components and modules, architectural changes in the system are difficult and rare.
For this purpose, we proposed and evaluated four architectures (1ML, HSML, HPML, 2ML) for potentially replacing existing bearing load prediction modules with ML-driven counterparts. Despite using only 10% of data for training, for each of these architectures and for both Bayesian ridge regression and NNs, the models generalized and exhibited very minimal differences in performance between training, validation, and test sets.
Additionally, we showcased how an ML model can be automatically deployed and integrated into an off-theshelf PLC hardware platform. We provided various insights into deployment as well as source code optimization. For example, how to determine and incorporate platform and computation requirements for training ML models to ascertain that the deployed ML predictors can actually run on the designated platform.

Conceptual contributions
-We demonstrated the efficacy of applying ML for prediction purposes within a gas turbine CPS control system. With ML, we managed to reduce mean absolute error (vs traditional methods) by up to 60x and decrease worstcase over and under-predictions. -We evaluated the effects of replacing existing system components with ML-driven counterparts. In some cases, prediction module performance dropped even if a component was replaced with a seemingly better (when evaluated as an individual black box) ML-driven counterpart. Integration testing is key, as interaction between components is difficult to predict. Thus, as a key negative finding, we experienced the violation of compositionality in components of prediction chains, thus providing a major barrier for incremental re-certification. -We proposed and validated methodology for automating deployment of ML models into a low-level control system of gas turbines. By using code generation from a well-defined ML model template, we were able to deploy neural networks and linear models onto a PLC.
Engineering/industrial impact -Thanks to the automated deployment of prediction modules, over 20 engineering workdays are saved each time an update to a prediction module is required. -Within our deployment framework we incorporated versioning to distinguish between prediction modules. This improved version comparisons and the ability to revert to previous versions easily if necessary. -By using code generation, we showed how a prediction module can be deployed to multiple supported PLC platforms from one ML model artifact. This reduces the number of software defects thus increasing software quality across multiple supported hardware platforms.
The work presented in this paper will be continued at Siemens Energy. It will be applied to other subsystems and deployed in the field in new and revised engines. Further studies, especially in other CPS domains, would help validate opportunities of automated code generation for deployment of ML in existing systems. This would provide other unique case studies with different data sets and hardware platforms which would greatly decrease existing threats to external validity. We believe researching methods which improve compositionality for prediction components would be highly beneficial to the industry at large.