Reliability Improvements

This page is meant for collecting a number of known architectural tactics or model improvements for optimizing the reliabilty of software architectures.

Reliability Improvements Collection

Reference Name	Type of Improvement	Short Description	Reference	How to model in PCM
High-reliability components	Reliability (Software)	Spend higher effort on implementation and testing process of components to achieve higher component reliability	---	Decrease internal action failure probabilities
High-availability hardware	Availability (Hardware)	Spend more money on buying better hardware	---	Increase physical resource MTTF
High-reliability network	Reliability (Network)	Spend more money on more reliable / higher capacity network links	---	Decrease communication link failure probability
Change component deployment	Topological	Change deployment of components to servers in a way such that "reliability-sensitive" components are located on servers with high availability.	---	Change PCM allocation model
Change component assembly	Topological	If possible, change assembly of components such that services are provided by the least "reliability-sensitive" components	---	Change PCM system model
Optimize external services (NOT YET SUPPORTED)	Reliability (external Services)	Spend more money on system-external services with higher reliability.	---	Increase reliabillity of system-external services
Redundant hardware (also: fault-tolerant hardware, fail over)	Availability (Hardware)	Spend more money for usage of redundant physical resources (e.g., RAID arrays, redundant CPUs, redundant servers)	Rozanski2005	Assume for the unavailability of n-time redundant resources in steady state: U(n) = U(1)^n Under the assumption MTTF(n)+MTTR(n)=MTTF(1)+MTTR(1) it follows: MTTR(n)=(MTTR(1)^n)/((MTTF(1)+MTTR(1))^(n-1)) and MTTF(n)=MTTF(1)+MTTR(1)-MTTR(n)
Heartbeat (also: ping/echo)	Availability (Hardware)	Spend additional money for a monitoring component / system that periodically tests the availability of physical resources. If a resource turns out to be unavailable, an immediate repair action can be taken.	Bass2003 Kim2009	Decrease the MTTR of the monitored resource in steady state to MTTR=M/2+R with (average) check interval M and (average) repair time. Possibly decrease processing speed of monitored resource if the monitoring puts load on the resource.
Design diversity / n-version-programming	Reliability (Software)	Realize one algorithm in n different ways. Let each computation request be handled by all versions simultaneously. Apply a voting algorithm that collects all results and applies a certain strategy to choose one of the results (e.g., majority voting). Higher costs arise from designing n algorithms, and n-times computational load at run-time.	Bass2003 Kienzle2003	Think of an internal action as being executed n times. Assume a failure probability for each version; assume a certain voting strategy. Calculate the overall failure probability depending on the individual failure probabilities and the chance that the voting algorithm decides to take the right decision. Increase the resource consumption of the internal action to reflect the n-times computation and the voring overhead.
Transaction Logging (NOT YET SUPPORTED)	Reliability (Software)	Log the steps of transactions to a persistent storage, to be able to redo certain steps upon system failure	Rozanski2005 Kienzle2003	Explicitly model how the system recovers from a failure, and follows an alternative execution path.
Rejuvenation Techniques	Reliability (Software)	Automatically restart components, applications servers, or operating systems after failures to ensure high availability	?	You need a model of how the internal action failure probability inceases over time. Such a model, combined with a regular restart, yields an average failure probability that can be used as a fixed value for the internal action.

Notes

data diversity
environment diversity
incorporate sensitivity in analysis to know where to start with heuristic
n-version programming: very high costs for additional component (new development, not only licensing)