Reliability Improvements
This page is meant for collecting a number of known architectural tactics or model improvements for optimizing the reliabilty of software architectures.
Reliability Improvements Collection
Reference Name | Type of Improvement | Short Description | Reference | How to model in PCM |
---|---|---|---|---|
High-reliability components | Reliability (Software) | Spend higher effort on implementation and testing process of components to achieve higher component reliability | --- | Decrease internal action failure probabilities |
High-availability hardware | Availability (Hardware) | Spend more money on buying better hardware | --- | Increase physical resource MTTF |
High-reliability network | Reliability (Network) |
Spend more money on more reliable / higher capacity network links | --- | Decrease communication link failure probability |
Change component deployment | Topological |
Change deployment of components to servers in a way such that "reliability-sensitive" components are located on servers with high availability. | --- | Change PCM allocation model |
Change component assembly | Topological | If possible, change assembly of components such that services are provided by the least "reliability-sensitive" components | --- | Change PCM system model |
Optimize external services (NOT YET SUPPORTED) |
Reliability (external Services) |
Spend more money on system-external services with higher reliability. | --- | Increase reliabillity of system-external services |
Redundant hardware (also: fault-tolerant hardware, fail over) |
Availability (Hardware) |
Spend more money for usage of redundant physical resources (e.g., RAID arrays, redundant CPUs, redundant servers) |
Rozanski2005 |
Assume for the unavailability of n-time redundant resources in steady state: U(n) = U(1)^n Under the assumption MTTF(n)+MTTR(n)=MTTF(1)+MTTR(1) it follows: MTTR(n)=(MTTR(1)^n)/((MTTF(1)+MTTR(1))^(n-1)) and MTTF(n)=MTTF(1)+MTTR(1)-MTTR(n) |
Heartbeat (also: ping/echo) | Availability (Hardware) |
Spend additional money for a monitoring component / system that periodically tests the availability of physical resources. If a resource turns out to be unavailable, an immediate repair action can be taken. |
Decrease the MTTR of the monitored resource in steady state to MTTR=M/2+R with (average) check interval M and (average) repair time. Possibly decrease processing speed of monitored resource if the monitoring puts load on the resource. | |
Design diversity / n-version-programming | Reliability (Software) | Realize one algorithm in n different ways. Let each computation request be handled by all versions simultaneously. Apply a voting algorithm that collects all results and applies a certain strategy to choose one of the results (e.g., majority voting). Higher costs arise from designing n algorithms, and n-times computational load at run-time. | Think of an internal action as being executed n times. Assume a failure probability for each version; assume a certain voting strategy. Calculate the overall failure probability depending on the individual failure probabilities and the chance that the voting algorithm decides to take the right decision. Increase the resource consumption of the internal action to reflect the n-times computation and the voring overhead. | |
Transaction Logging (NOT YET SUPPORTED) |
Reliability (Software) |
Log the steps of transactions to a persistent storage, to be able to redo certain steps upon system failure | Explicitly model how the system recovers from a failure, and follows an alternative execution path. | |
Rejuvenation Techniques | Reliability (Software) |
Automatically restart components, applications servers, or operating systems after failures to ensure high availability | ? | You need a model of how the internal action failure probability inceases over time. Such a model, combined with a regular restart, yields an average failure probability that can be used as a fixed value for the internal action. |
Notes
- data diversity
- environment diversity
- incorporate sensitivity in analysis to know where to start with heuristic
- n-version programming: very high costs for additional component (new development, not only licensing)