Reliability Prediction Case Study: Industrial Control System

Aus SDQ-Wiki

This page describes a case study conducted at ABB Corporate Research (Germany) of the PCM reliability prediction approach on a process control system from ABB. Data collection for the PCM model was conducted in 2010 (see ISSRE'11 paper) as part of the EU Project Q-ImPrESS. The new PCM reliability modeling constructs were subsequently used to construct a full PCM model (see IEEE-TSE'12 paper). Finally the model was extended to reflect different product variants (see QoSA'11 paper).

Industrial automation domain

Industrial Automation Domain
Figure 1: Industrial Control System

Industrial automation deals with the use of control systems and information technology to reduce manual work in the production of goods and services. While industrial automation systems originate from manufacturing, now there are many other application scenarios. Industrial control systems are for example used for power generation, traffic management, water management, pulp and paper handling, printing, metal handling, oil refinery, chemical processes, pharmaceutical manufacturing, or carrier ships.

Data collection

We constructed a PCM-based reliability model for a process control system developed by ABB. The system implementation consists of several millions lines of C++ code. On a high abstraction level, the core of the system consists of eight subsystems, which we treated as software components. These components can be flexibly deployed on multiple servers depending on the system capacity required by customers.

To construct the PCM model, we collected the following input data (details in Koziolek2010).

  • System topology: was reconstructed and abstractly defined based on existing architectural documentation, stakeholder interviews and evaluating an running instance of the system.
  • Component failure probabilities: were determined by applying the Littlewood/Verrall software reliability growth model (SRGM) recommended in IEEE1633-2008 on failure report data from the system's bug tracking system
  • Hardware MTTF/MTTR: were determined based on a field study by Schroeder and Gibson
  • SEFFs and component transition probabilities: were derived from an instrumented version of the system which logged subsystem transitions for two typical usage scenarios for two days.

Industrial Control System Model

Figure 2 depicts a possible configuration of the system with three servers. The names of the components and their failure probabilities have been obfuscated for confidentiality reasons. We modelled four of the most important usage scenarios, which are executed in parallel during system execution. Each usage scenario triggers a different control and data flow through the system.

Industrial Control System Model
Figure 2: Industrial Control System Model (Overview)

The current model of the system resides on a high abstraction level (8 subsystems for several million lines of code). While this enables determining the most critical subsystem for the system reliability, we would need a lower abstraction level to make detailed recommendations on how to improve the system. A lower abstraction level however requires other data collection methods for internal action failure probabilities. We will investigate this issue in future work.

Prediction Results

Figure 3 shows the sensitivity of the system reliability to varying internal action failure probabilities.We omitted the concrete specification of the system reliability (y-axis) for confidentiality reasons. The system reliability is most sensitive to InternalAction1 (from component C1), because the respective component is used with a high probability in usage scenarios 3 and 4. The system reliability is least sensitive to InternalAction4 (from component C3), because this component is used only in rare cases.

Sensitivity Analysis
Figure 3: Sensitivity to Software Failures

Figure 4 shows the system reliability for five different system-level usage profiles (i.e., probability distributions for the system-level usage scenarios). These profiles result from adjusting the usage scenario probabilities in the range of typical customer behaviours. The system reliability is again obfuscated. Notice that the origin of the y-axis is not 0.0, as the graph is zoomed in to highlight the differences. The maximum difference in the system reliability is 0.2 percent, which is significant because it results in a number of customer perceived errors.

Usage Profiles
Figure 4: System Reliabilities for different Usage Profiles