Reliability Prediction Case Study: Media Store

Aus SDQ-Wiki

This page gives an overview of a case study conducted for PCM-based reliability modelling and prediction, which was reported on in our QoSA 2011 publication. The study focuses on a web-based media store product line. The model is inspired by common data storage solutions and has similar functionality to the ITunes Music Store.

Media Store Model

The media store system provides a centralized storage for media files, such as audio or video files, and a corresponding up- and download functionality. For upload, a user sends a list of files to the system, which stores the files in its central database. Large files are compressed before they a stored, to save disk space. For download, the user sends a list of file names to the system, which retrieves the corresponding files from the database and sends them back to the user. Figure 1 summarizes the different products and design alternatives of the media store product line.


Media Store Products and Design Alternatives
Figure 1: Media Store Products and Design Alternatives


Products

As a product line, the media store contains three basic product configurations:

  • standard: provides basic functionality and a small installation for a limited number of users
  • comfort: provides extensions for a more complex business logic and a more sophisticated user interface
  • power: provides high-performance hardware and software extensions to support a bigger number of users

Core Components

The core functionality is provided through six component types:

  • UserInteraction
    • used in the standard and power products
    • contains the business logic to process user requests
  • UserInteractionComfort
    • used in the comfort product
    • provides additional functionality collecting user statistics
  • FileLoader
    • used in the standard and comfort products
    • controls the up- and download of media files
    • contains a cache memory for fast retrieval of files requested for download (average hit rate 10%)
  • FileLoaderPower
    • used in the power product
    • includes a bigger cache for better file retrieval performance (hit rate 60%)
  • Encoder
    • used in the standard and comfort products
    • compresses large uploaded media files before they are stored in the database
  • EncoderPower
    • used in the power product
    • incorporates fast compression algorithms to serve more concurrent user requests (however, is less reliable as the Encoder component)
  • DataAccess
    • used in all products
    • stores media files (and statistical data) in the database, and retrieves them back

All components are deployed on a single application server; only the database and the DataAccess component are deployed on one (or optionally two) database server(s). Our model contains a hard disk drive (HDD) and CPU resource for each server.

Failure Types

During up- and download of media files, different types of failures may occur in the involved component instances:

  • BusinessLogicFailure: may occur during the processing of user requests in the UserInteraction[Comfort] component due to software faults
  • CacheAccessFailure: may occur in the FileLoader[Power] component as a software failure induced by malfunctioning cache memory
  • EcodingFailure: may occur due to bugs in the compression algorithm of the Encoder[Power] component
  • DataAccessFailure: may occur in the DataAccess component due to internal database errors or faults in the database server's file system
  • CommunicationFailure: occurs when messages sent between the media store servers are corrupted or lost, which is mainly due to network overload
  • CPUFailure: occurs if a CPU is unavailable while being accessed during service execution
  • HDDFailure: occurs if a hard disk drive is unavailable while being accessed during service execution

Failure Probabilities

Regarding failure probabilities, we generally assume a value of 10-5 for each individual point of failure and software failure type in the model, with the following exceptions:

  • In the UserInteractionComfort component, the probability of BusinessLogicFailures is 10-4 because of the more complex business logic compared to the standard variant.
  • Compression algorithms are generally complex and may fail with a probability of 10-4 in the Encoder component, and with 2 x 10-4 in the EncoderPower component.
  • For all hardware resources, we assume a MTTF of one year and a MTTR of 50 minutes, resulting in a steady-state availability of 99.99%.

In other settings, these values could have been extracted from log files of existing similar systems.

Fault Tolerance Mechanisms

Fault tolerance mechanisms may optionally be introduced into each of the media store products, in terms of additional components (which are shown in gray in Figure 1):

  • UserInteractionFT
    • may be put in front of UserInteraction[Comfort]
    • has the ability to buffer incoming requests, to re-initialize the business logic in case of a BusinessLogicFailure, and to retry the failed request
  • FileLoaderFT
    • may be put in front of FileLoader[Power]
    • in case of a CacheAccessFailure, retries the failed download forcing direct data retrieval from the database without cache access
  • EncoderFT
    • may be put in Front of Encoder[Power]
    • buffers files to compress, and applies an own compression algorithm in case of an EncodingFailure
    • uses a slow, but very reliable compression algorithm
  • DataAccessFT
    • may be put in front of DataAccess
    • makes use of a backup database server, to retry any store or retrieve request in case of a hardware failure of the main database server

Each of the described fault tolerance mechanisms can be used for each product, and more than one mechanism may be applied in parallel. However, due to space limitations we focus on cases where at most one mechanism is used, which means that five design alternatives exist for each product (either no fault tolerance is used, or one of four fault tolerance mechanisms is used). Figure 2 summarizes all possible media store configurations - including fault tolerance mechanisms - in a feature diagram.


Media Store Feature Diagram
Figure 2: Media Store Feature Diagram


Usage Profile

We generally assume a usage profile with one upload request (probability: 20%) or one download request (probability: 80%). The number of files per request is set to constantly 10. For upload requests, we assume that each file has a probability of 30% to be large, meaning that it requires compression by the Encoder[Power] component.

Download

Download the complete media store model instance here.

Evaluation Results

We calculated the expected system reliability for each media store product and design alternative. Each calculation took below one second on a standard PC with a 2.2 GHz CPU and 2.00 GB RAM. The results of the evalution are shown in Figures 3 to 9. Further explanation and interpretation of results can be found in our QoSA 2011 publication.


Evaluation Results
Figure 3: System Reliability of all Media Store Products and Design Alternatives


Evaluation Results
Figure 4: Relative Reduction of System Failures of the Media Store Design Alternatives, compared to the Alternative without Fault Tolerance


Evaluation Results
Figure 5: Failure Probabilities of the Media Store Products without Fault Tolerance, cumulated over all Failure Types


Evaluation Results
Figure 6: Failure Probabilities per Failure Type of the Media Store Products without Fault Tolerance


Evaluation Results
Figure 7: Failure Probabilities per Failure Type of the Media Store Power Product without Fault Tolerance, with the Number of Files requested per Upload/Download varied between 2 and 20


Evaluation Results
Figure 8: System Reliability of the Media Store Comfort Product for all Design Alternatives, with Alterations of single Failure Type Probabilities


Evaluation Results
Figure 9: System Reliability of all Media Store Products in the UserInteractionFT Design Alternative, compared to Results of the Reliability Simulation