PCM-based Reliability Modelling

Aus SDQ-Wiki

The Palladio Component Model (PCM) provides a design-oriented modelling language for component-based software architectures. Architectural specifications created in terms of PCM instances can be evaluated with respect to quality attributes such as performance and reliability.

This page describes the reliability-specific concepts of the PCM modelling language, which allow for expressing failure potentials and capabilities for failure recovery as part of a modelled software architecture. These concepts have been developed in the context of an approach to PCM-based reliability modelling and prediction. The description assumes that the reader is familiar with general architecture modelling as done with PCM (for further information, see the PCM tutorials). A simple example model is used to illustrate the presented concepts and can be downloaded here.

Failure Types

As a main prediction result, PCM-based reliability prediction determines the probability P(SUCCESS|U) of a successful (i.e. failure-free) run through a PCM UsageScenario U. The counterpart P(FAILURE|U) = 1 - P(SUCCESS|U) determines the overall failure probability of the scenario. However, in order to get a more detailed picture of the reliability impacts of different existing failure potentials, the approach allows for differentiating multiple FailureTypes. Through this concept, the potential points of failure (PPOF) and points of recovery (POR) in the modelled control flow can be associated with type information to distinguish different kinds of failures, and the analysis correspondingly differentiates the overall P(FAILURE|U) result.

Repository Specification with Failure Types
Figure 1: Repository Specification with Failure Types

Figure 1 shows an example for the specification of FailureTypes, which are contained in a PCM Repository model. The figure shows a "BasicRepository" with two components "BusinessLogic" and "DatabaseAccess". By default, each created PCM Repository references two further pre-defined Repositories "PrimitiveTypes" and "FailureTypes", as well as a ResourceRepository "Palladio". The abstract meta-model class FailureType has concrete subclasses according to the software, hardware and network failure dimensions:

  • SoftwareInducedFailureType: Denotes a failure occurrence due to a flawed software implementation; is associated with an ID and a name. Modellers can freely distinguish SoftwareInducedFailureTypes as they wish. Examples for possible type differentiations include the failure-causing software layers (e.g. application-level, middleware, operating system), the failure effects (e.g. wrong computation result, synchronisation error) and the criticality of failures (e.g. minor, critical, catastrophic). In the example of Figure 1, four specified SoftwareInducedFailureTypes represent specific failure situations "RequestBufferOverflowFailure", "UserCommandInterpretationFailure", "CertificateProcessingFailure" and "AuthenticationProtocolFailure", which are expected to potentially occur during service execution of the system under study.
  • HardwareInducedFailureType: Denotes a failure occurrence due to service execution trying to access an unavailable hardware resource; includes a name, ID and a reference to a ProcessingResourceType. A PCM instance should contain at most one HardwareInducedFailureType per ProcessingResourceType. As Figure 1 shows, the pre-defined "FailureTypes" Repository already contains HardwareInducedFailureTypes for the "CPU", "HDD" and "DELAY" ProcessingResourceTypes. A HardwareInducedFailureType only needs to be specified if any modelled system-external failure potentials or failure recovery capabilities refer to the corresponding ProcessingResourceType.
  • NetworkInducedFailureType: Denotes a failure occurrence due to an unsuccessful attempt to transmit a service invocation message or return message over a LinkingResource; includes a name, ID, and a reference to a CommunicationLinkResourceType. The pre-defined "FailureTypes" Repository already contains a NetworkInducedFailureType for the "LAN" CommunicationLinkResourceType. A PCM instance should contain at most one NetworkInducedFailureType per CommunicationLinkResourceType, and it is only required for the specification of system-external failure potentials or failure recovery capabilities.

Software Failure Potentials

Software failure potentials refer to flaws in the software implementation of an IT system, which may lead to failure occurrences during service execution. Although the exact nature of these flaws is typically unknown for a concrete system under study, modellers can express a certain "expected failure potential" through independent per-visit failure probabilities annotated to InternalActions, which thereby become PPOFs in the architectural control flow.

Behavioural Specification with Failure Potentials and Recovery Capabilities
Figure 2: Behavioural Specification with Failure Potentials and Recovery Capabilities

Figure 2 shows a behavioural specification which includes software failure potentials, such as the one annotated to the InternalAction "ParseUserRequest" through two InternalFailureOccurrenceDescriptions. Each InternalFailureOccurrenceDescription references a SoftwareInducedFailureType specified in the PCM Repository and includes a failureProbability value between 0 and 1. Within an InternalAction, all included InternalFailureOccurrenceDescriptions must refer to different SoftwareInducedFailureTypes, and the sum of all failureProbabilities must not exceed 1. In the example, the specification expresses that whenever the InternalAction "ParseUserRequest" is visited during service execution, it may (i) lead to a failure occurrence of type "RequestBufferOverflowFailure" with probability 0.00000005, or (ii) result in a "UserCommandInterpretationFailure" with probability 0.0000001, or (iii) be completed without any failure occurrence with probability 1 - 0.00000005 - 0.0000001 (assuming that the required "CPU" resource is perfectly available). This semantics implies that at most one failure can occur while executing the InternalAction.

Hardware Failure Potentials

Hardware failure potentials refer to the limited availability of hardware resources, which may lead service execution into failure. In the PCM semantics, any hardware resource is either available or unavailable at any point in time (switching forth and back between these two states), and any attempt to access a currently unavailable hardware resource during service execution leads to a hardware-induced failure occurrence. The specification of hardware failure potentials comprises two aspects, namely (i) the availability of hardware resources, and (ii) the usage of resources by the service execution.

Resource Environment Specification with Reliability Annotations
Figure 3: Resource Environment Specification with Reliability Annotations

Figure 3 illustrates the first aspect. Each ProcessingResourceSpecification in a PCM ResourceEnvironment model is annotated with a Mean-Time-To-Failure (MTTF) and Mean-Time-To-Repair (MTTR) value. Both values must be positive and adhere to the same time unit (e.g., hours). As an exception from this rule, perfect resource availability is expressed by setting both the MTTF and MTTR to 0. In the example, two ResourceContainers "ApplicationServer" and "DatabaseServer" each contain one ProcessingResourceSpecification referring to the "CPU" ProcessingResourceType. The MTTF values are set to 105120 hours (i.e. 12 years), and the MTTR values are set to 2 hours. The PCM-based reliability prediction uses these values to calculate the steady-state availability of the resource A = MTTF / (MTTF + MTTR). This value determines the probability that the resource is available when being accessed during service execution.

The usage of hardware resources during service execution is modelled through InternalActions through their ParametricResourceDemands as shown in Figure 2. In the example, the "ParseUserRequest" InternalAction contains a ParametricResourceDemand that references the "CPU" ProcessingResourceType. Assuming that the SEFF-owning "BusinessLogic" BasicComponent is allocated to the "ApplicationServer" ResourceContainer, the execution of "ParseUserRequest" may fail with probability 1 - 105120 / (105120 + 2) = 0.000019 due to unavailability of the underlying resource (notice that the actual amount of requested "CPU" work units is not taken into account for reliability prediction). Hence, InternalActions may be subject to hardware-induced failure occurrences as well as software-induced failure occurrences. If an InternalAction requests multiple ProcessingResourceTypes, each of the corresponding resources must be available for the InternalAction to succeed.

A further extension of the approach takes into account the fact that for some resources (such as CPUs), their unavailability typically makes the whole surrounding ResourceContainer inoperable. A boolean attribute requiredByContainer of the ProcessingResourceSpecification indicates if the availability of the represented resource is essential for the operability of the container. If set to true, service execution experiences a hardware-induced failure occurrence whenever it enters a ResourceDemandingSEFF of a component allocated to the container while the resource is unavailable. Hence, ExternalCallActions become PPOFs with respect to hardware failure potentials, if they constitute service invocations to remote components hosted on potentially inoperable ResourceContainers.

Network Failure Potentials

Besides the software and hardware failure dimensions, network communication constitutes a further potential source of failure. The PCM allows for expressing network failure potentials from a high-level point of view, without going into the details of involved network components and protocols. More concretely, a CommunicationLinkResourceSpecification referenced by a LinkingResource within a PCM ResourceEnvironment can be annotated with a non-negative failureProbability value, denoting the probability that an attempt to transfer a message over the link is unsuccessful and leads the service execution into failure.

As an example, Figure 3 shows the "LANConnection" LinkingResource associated with a failureProbability of 0.001 through its included CommunicationLinkResourceSpecification. Whenever a service invocation message or return message between two components goes over the "LANConnection", a network-induced failure occurs with probability 0.001. Therefore, ExternalCallActions such as the one shown in Figure 2 become PPOFs (besides the InternalActions) if the callee is a remote component.

System-external Failure Potentials

PCM modelling allows for expressing the usage of system-external services in order to provide the system's own services. As an example, Figures 4 and 5 show a PCM System model "BasicSystem" with a system-level OperationRequiredRole "SR_IPublicUserAuthentication" that provides service to the "R_IPublicUserAuthentication" OperationRequiredRole of the "BusinessLogic" BasicComponent encapsulated in the "AS_BusinessLogic" AssemblyContext. Hence, ExternalCallActions within "BusinessLogic" targeted at "R_IPublicUserAuthentication" are not served within the "BasicSystem" but by an external provider.

System Specification with Required System-external Services
Figure 4: System Specification with Required System-external Services

Specification of System-external Failure Potentials
Figure 5: Specification of System-external Failure Potentials

Generally, the invocation of system-external service operations constitutes a possible source of failure for the system's service execution. Modellers can express the corresponding failure potentials as shown in Figure 5. The System contains QoSAnnotations, which in turn can contain SpecifiedReliabilityAnnotations. A SpecifiedReliabilityAnnotation references a Role (in the example: "SR_IPublicUserAuthentication") and a Signature ("authenticateUser") to unambiguously determine a certain system-external service operation. Furthermore, it contains a list of ExternalFailureOccurrenceDescriptions, where each element in the list includes a non-negative failureProbability value (0.0000001) and references a FailureType ("AuthenticationProtocolFailure"). The specification of ExternalFailureOccurrenceDescriptions adheres to the same rules as the specification of InternalFailureOccurrenceDescriptions for InternalActions. However, ExternalFailureOccurrenceDescriptions are more general in that they allow for referencing not only SoftwareInducedFailureTypes, but also HardwareInducedFailureTypes and NetworkInducedFailureTypes. Through the specification of system-external failure potentials, ExternalCallActions become PPOFs if they represent invocations of system-external service operations.

Failure Recovery

The specification of failure recovery capabilities in a system's architecture complements the modelling of failure potentials: even though InternalActions and ExternalCallActions constitute PPOFs in the modelled control flow, failures occurring at those PPOFs are not necessarily perceived as such by the system's users. The system may be able to autonomously compensate for the failure occurrences. To this end, modellers can include RecoveryActions in their behavioural specifications. An example is shown by Figure 2. A RecoveryAction includes a set of inner RecoveryActionBehaviours. One of the behaviours is marked as being the primaryBehaviour and executed first upon entering the RecoveryAction. Other behaviours handle failure occurrences of the primary behaviour, and their failure occurrences may in turn be handled by further behaviours. Overall, the RecoveryActionBehaviours span a tree within the RecoveryAction, where each behaviour includes a set of failure-handling child behaviours. Each child specifies a set of FailureTypes which it can handle (all kinds of FailureTypes related to software, hardware and network are allowed). For each parent behaviour, the sets of handled FailureTypes of all child behaviours must be disjoint, so that a child behaviour can always be unambiguously determined if a parent behaviour fails. The RecoveryAction is left if either one behaviour completes failure-free, or if a failure occurrence within one behaviour is not handled by any of its child behaviours.

In Figure 2, the task of user authentication is performed in a fault tolerant way. The RecoveryAction "FaultTolerantUserAuthentication" includes the "InternalAuthentication" as its primary RecoveryActionBehaviour, expressing that internal authentication is attempted first. This behaviour may fail with a "CertificateProcessingFailure" during execution of the included InternalAction "AuthenticateUser". In this case, execution proceeds with the "ExternalAuthentication" RecoveryActionBehaviour, which includes an ExternalCallAction for user authentication with the help of the system-external service operation "IPublicUserAuthentication.authenticateUser". As Figure 5 shows, the invocation of this operation can result in a "AuthenticationProtocolFailure", which is not further handled by the "FaultTolerantUserAuthentication" RecoveryAction. If, on the other hand, the "InternalAuthentication" succeeds, the RecoveryAction is left without executing the "ExternalAuthentication" alternative. In conclusion, RecoveryActions may handle local failure occurrences and constitute the PORs in the modelled control flow.

Open Issues

Some open issues currently remain in the implementation of the approach, which should be taken into account during reliability modelling:

  • Some model elements are required during model creation even though they are irrelevant for reliability prediction (e.g., system workloads, resource demand sizes, resource processing speeds). A reliability-specific model completeness and consistency check is currently still missing (see PALLADIO-107).
  • For the specification of stochastic expressions in the model, it must be taken into account that there are limitations, i.e. certain types of expressions are not supported by the reliability solver.
  • The consideration of variable scopes in the reliability solver is limited. For example, the solver does not consider existing stochastic dependencies between two consecutive branches that depend on the same input parameter property.
  • The latest extensions of the PCM may include model elements which are not supported for reliability modelling and prediction. If a modelled PCM instance contains such elements, the reliability solver will not run through but abort its analysis.