Help me choose
2021/October 2021

Availability and Performance in Software Architecture

by hajinny 2021. 10. 8.

Here, we discuss two requirements that a software might need to meet, and how the requirements can be met through different tactics.

Requirement 1: Availability

Availability: the proportion of times client's requests can be served

 

We want ATM machines to work 24/7 as per specification. We expect desktop application to work 8am to 12pm, so we don’t care outside of that hour. Availability is concerned with whether the system is available to carry out what it's specification says it would meet. It's actually fine for a system to not be available for maintenance period that the client has already agreed on as part of the specification of the system - if the system is available all the other times, then it is 100% available.

 

Availability is inextricably linked to failure (externally observed fault), because we can define unavailability of the service as the situation in which failure occurs. Conversely, if the faults are 100% masked, then we will not see failure, so the system is 100% available.

 

Note: fault is part of the code that has a potential to cause incorrect behaviour

 

To summarise, availability is a function of failure. Specifically, we want to maximise mean time to failure, minimize mean time to repair. Failure can be of two categories: incorrect behaviour of system (incorrect output, incorrect timing) and unavailable service (does not even produce output).

 

Availability tactics

We want to devise a tactic to prevent, mask, repair fault so that the fault is not manifested as failure.

Tactic 1. To prevent fault, we need to detect fault

We can have a central monitor that notifies us when faults are discovered from the worker components. How do we discover faults? 

 

Tactics method Mechanism Description Evaluation
Ping Request from Monitor and response from worker components. Every x seconds, inquire 'are you alive',then the component says 'yes I am' Requires round trip of packet between monitor and worker component
Heartbeat Worker component sends reporting to the monitor (one-way only) Worker is responsible for notifying monitor that it’s alive. Requires a packet to travel from workerto monitor
Self-test   Worker self-tests its functionality, and reports to monitor if it malfunctions. Actually, monitor could tell worker to self-test, but this wasn't mentioned. Relies on the component to self-test
Exception   error in component -> raise exception -> related part (eg monitor) is informed of the problem so it takes necessary actions.   
Timestamp Client puts timestamp on packets Ensures sequence of events are received and sent (packets go from client to server)  

Overhead with Ping and Heartbeat

Ping, heartbeat have overhead, since both are actually packets to be sent.

Solution: Just include heartbeat/ping messages to our usual message, to minimize overhead in network. Piggybacking the heartbeat on other packets that are sent to communicate relevant data to the task, we minimize network overhead.

 

Tactic 2: After detecting fault, we have to repair fault (Recovery)

Detecting fault can allow us to intervene (or allow autonomous systems to intervene) to fix the fault. However, to ensure availability, we don't want that intervention to be manifested as a delayed response. In other words, we want to make sure that the system still produces correct output to the client's response in the correct timing, and this is also called masking the fault. We usually do so by redundancy (having many servers that can provide the same service). 

Tactics Method Mechanism Description
Active redundancy Maintain several active servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. They all attempt to process client request at the same time If the main server doesn’t respond, other server (with the exact same state)’s response will be used by the client 
Passive redundancy Maintain several servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. Non-main servers will only attempt to process client request when required If the main server doesn't respond, the request is made to next available server, and so on until request can be served.
Voting Maintain a voting system in the middle between client and servers Client sends a request to the voting system, where the voting system all servers' response. Then, the answer that majority agrees on is passed to the client. The minority is suspected as a failed server.
Spare This is mostly manual replacement of malfunctioning server or components of the server. This obviously takes time. When the main server is down, Spare tactic is either booting up dormant server, or replacing some components to those that work.


Tactic 3: after discovering fault, we want to fix it then reintroduce the fixed server to system

The following actions are all probably necessary in order to get a fixed server to be back on its job.

Tactics Method Description  
Shadow Run the fixed server in parallel to the currently running servers, and ensure that their behaviours are equivalent
State resynchronisation Ensure state of the currently running servers is the same as state of the reintroduced, fixed server
Rollback Rollback all the states of the system to valid states.

Exercise: Apply availability tactics

Imagine the scenario below:

This is the thought process to answer this question:

The purpose of tactic is to make sure that stimulus (crash) produces response (recording failure), while accounting for the measure (within 10 minutes, no loss of service).


Measure here requires:
1. No loss of service => availability: redundancy (server crashed)

- Automated redundancy: active/passive redundancy doesn't lose service (the loss is almost not visible)

2. Failure recorded within 10 minutes => need failure detection 

- Failure detection: Heartbeat/ping detects failure, where HB/ping period is less than 10 minutes. 

- Switching to available server might require failure detection, so the HB/Ping period might have to be frequent.

 

 

 

Requirement 2: Performance

Performance: time taken to correctly answer client request

 

Measuring performance depends on processing time (time for system to work towards response) and blocked time (time for which there is no progress to the response)

 

Processing time 

Factor 1: Resource consumption.

Description: Resource here refers to hardware (CPU, data storage, memory, network) and software (buffer management). It is important to consider impact of resource consumption 

Examples: Overloading CPU increases processing time. Heavy memory consumption leads to software crash hence increased processing time

 

Factor 2: Some things that are not directly related to the task 

Examples: You might have already prepared the response for client request, but you might need to do marshalling to send data to network, taking time.


Blocked time 

Factor 1: Multiple clients needing same resource -> contention, deadlock, delays

Factor 2: Resource unavailable -> this resource could be a component in failure state
Factor 3: dependency on other components -> wait for other components to finish its computation

 

Performance tactics

We want to ensure that event (stimuli, can be periodic/sporadic/stochastic) is processed within some time

Tactic 1: Reduce demand for resource (Resource control)

There are three ways to achieve performance tactics:

1. Reduce resources required for processing stimuli

Tactic action For example... How it helps performance
Increase computational efficiency (eg use better algorithm)  reduces resources needed

reduce CPU/memory load
Reduce computational overhead minimizes intercomponent communications Obvious

2. Decrease number of events to process

Tactic action For example... How it helps performance
Decrease the frequency of sampling where real time data update is not necessary Less sampling - less blocking/processing time
Reduce number of events to process - DDOS could be avoided if the server just drops new requests while under load
- Use cache
Obvious

3. Control resources consumed

Tactic action For example... How it helps performance
Bound execution time if processing request takes long, we just drop processing that request. (note specification of the system might not accept this)

Eg shop vendor might be busy, so ‘please come later’. “please visit later”. 
Obvious
Bound queue sizes We can have a queue, so that if the queue is full we just drop request. We can increase queue too, if more needs to be processed Obvious

Tactic 2: Resource management (Response side)

Methods Description
Increase computational efficiency We can do searching on database side rather than loading everything in.

Faster search algorithm.
increase available resource Add new server, increase configuration for server, by using larger memory or better CPU, might move to cloud.

Can have many servers to support concurrent search -> load balancer
Database efficiency Multiple copies of database: copy of data in every server, so no need to connect to remote database but use local

Tactic 3: Resource arbitration

decides which request to allocate resource to – set resource allocation policy 

Example 1: apply performance tactic to online shopping

Example 2: resource arbitration

Allocating resource to time sensitive process: alarm when fire:

- use either semantic importance (better) or earliest deadline first policy (set deadline very early so that it's processed earliest) for scheduling alarm