Availability and Performance in Software Architecture

Here, we discuss two requirements that a software might need to meet, and how the requirements can be met through different tactics.

Requirement 1: Availability

Availability: the proportion of times client's requests can be served

We want ATM machines to work 24/7 as per specification. We expect desktop application to work 8am to 12pm, so we don’t care outside of that hour. Availability is concerned with whether the system is available to carry out what it's specification says it would meet. It's actually fine for a system to not be available for maintenance period that the client has already agreed on as part of the specification of the system - if the system is available all the other times, then it is 100% available.

Availability is inextricably linked to failure (externally observed fault), because we can define unavailability of the service as the situation in which failure occurs. Conversely, if the faults are 100% masked, then we will not see failure, so the system is 100% available.

Note: fault is part of the code that has a potential to cause incorrect behaviour

To summarise, availability is a function of failure. Specifically, we want to maximise mean time to failure, minimize mean time to repair. Failure can be of two categories: incorrect behaviour of system (incorrect output, incorrect timing) and unavailable service (does not even produce output).

Availability tactics

We want to devise a tactic to prevent, mask, repair fault so that the fault is not manifested as failure.

Tactic 1. To prevent fault, we need to detect fault

We can have a central monitor that notifies us when faults are discovered from the worker components. How do we discover faults?

Tactics method	Mechanism	Description	Evaluation
Ping	Request from Monitor and response from worker components.	Every x seconds, inquire 'are you alive',then the component says 'yes I am'	Requires round trip of packet between monitor and worker component
Heartbeat	Worker component sends reporting to the monitor (one-way only)	Worker is responsible for notifying monitor that it’s alive.	Requires a packet to travel from workerto monitor
Self-test		Worker self-tests its functionality, and reports to monitor if it malfunctions. Actually, monitor could tell worker to self-test, but this wasn't mentioned.	Relies on the component to self-test
Exception		error in component -> raise exception -> related part (eg monitor) is informed of the problem so it takes necessary actions.
Timestamp	Client puts timestamp on packets	Ensures sequence of events are received and sent (packets go from client to server)

Overhead with Ping and Heartbeat

Ping, heartbeat have overhead, since both are actually packets to be sent.

Solution: Just include heartbeat/ping messages to our usual message, to minimize overhead in network. Piggybacking the heartbeat on other packets that are sent to communicate relevant data to the task, we minimize network overhead.

Tactic 2: After detecting fault, we have to repair fault (Recovery)

Detecting fault can allow us to intervene (or allow autonomous systems to intervene) to fix the fault. However, to ensure availability, we don't want that intervention to be manifested as a delayed response. In other words, we want to make sure that the system still produces correct output to the client's response in the correct timing, and this is also called masking the fault. We usually do so by redundancy (having many servers that can provide the same service).

Tactics Method	Mechanism	Description
Active redundancy	Maintain several active servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. They all attempt to process client request at the same time	If the main server doesn’t respond, other server (with the exact same state)’s response will be used by the client
Passive redundancy	Maintain several servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. Non-main servers will only attempt to process client request when required	If the main server doesn't respond, the request is made to next available server, and so on until request can be served.
Voting	Maintain a voting system in the middle between client and servers.	Client sends a request to the voting system, where the voting system all servers' response. Then, the answer that majority agrees on is passed to the client. The minority is suspected as a failed server.
Spare	This is mostly manual replacement of malfunctioning server or components of the server.	This obviously takes time. When the main server is down, Spare tactic is either booting up dormant server, or replacing some components to those that work.

Tactic 3: after discovering fault, we want to fix it then reintroduce the fixed server to system

The following actions are all probably necessary in order to get a fixed server to be back on its job.

Tactics Method	Description
Shadow	Run the fixed server in parallel to the currently running servers, and ensure that their behaviours are equivalent
State resynchronisation	Ensure state of the currently running servers is the same as state of the reintroduced, fixed server
Rollback	Rollback all the states of the system to valid states.

Exercise: Apply availability tactics

Imagine the scenario below:

This is the thought process to answer this question:

The purpose of tactic is to make sure that stimulus (crash) produces response (recording failure), while accounting for the measure (within 10 minutes, no loss of service).

Measure here requires:
1. No loss of service => availability: redundancy (server crashed)

- Automated redundancy: active/passive redundancy doesn't lose service (the loss is almost not visible)

2. Failure recorded within 10 minutes => need failure detection

- Failure detection: Heartbeat/ping detects failure, where HB/ping period is less than 10 minutes.

- Switching to available server might require failure detection, so the HB/Ping period might have to be frequent.

Requirement 2: Performance

Performance: time taken to correctly answer client request

Measuring performance depends on processing time (time for system to work towards response) and blocked time (time for which there is no progress to the response)

Processing time

Factor 1: Resource consumption.

Description: Resource here refers to hardware (CPU, data storage, memory, network) and software (buffer management). It is important to consider impact of resource consumption

Examples: Overloading CPU increases processing time. Heavy memory consumption leads to software crash hence increased processing time

Factor 2: Some things that are not directly related to the task

Examples: You might have already prepared the response for client request, but you might need to do marshalling to send data to network, taking time.

Blocked time

Factor 1: Multiple clients needing same resource -> contention, deadlock, delays

Factor 2: Resource unavailable -> this resource could be a component in failure state
Factor 3: dependency on other components -> wait for other components to finish its computation

Performance tactics

We want to ensure that event (stimuli, can be periodic/sporadic/stochastic) is processed within some time

Tactic 1: Reduce demand for resource (Resource control)

There are three ways to achieve performance tactics:

1. Reduce resources required for processing stimuli

Tactic action	For example...	How it helps performance
Increase computational efficiency (eg use better algorithm)	reduces resources needed	reduce CPU/memory load
Reduce computational overhead	minimizes intercomponent communications	Obvious

2. Decrease number of events to process

Tactic action	For example...	How it helps performance
Decrease the frequency of sampling	where real time data update is not necessary	Less sampling - less blocking/processing time
Reduce number of events to process	- DDOS could be avoided if the server just drops new requests while under load - Use cache	Obvious

3. Control resources consumed

Tactic action	For example...	How it helps performance
Bound execution time	if processing request takes long, we just drop processing that request. (note specification of the system might not accept this) Eg shop vendor might be busy, so ‘please come later’. “please visit later”.	Obvious
Bound queue sizes	We can have a queue, so that if the queue is full we just drop request. We can increase queue too, if more needs to be processed	Obvious

Tactic 2: Resource management (Response side)

Methods	Description
Increase computational efficiency	We can do searching on database side rather than loading everything in. Faster search algorithm.
increase available resource	Add new server, increase configuration for server, by using larger memory or better CPU, might move to cloud. Can have many servers to support concurrent search -> load balancer
Database efficiency	Multiple copies of database: copy of data in every server, so no need to connect to remote database but use local

Tactic 3: Resource arbitration

decides which request to allocate resource to – set resource allocation policy

Example 1: apply performance tactic to online shopping

Example 2: resource arbitration

Allocating resource to time sensitive process: alarm when fire:

- use either semantic importance (better) or earliest deadline first policy (set deadline very early so that it's processed earliest) for scheduling alarm

'2021 > October 2021' 카테고리의 다른 글

Hierarchical routing (iBGP, eBGP, OSPF, RIP) (0)	2021.10.09
Client and server architecture (0)	2021.10.09
Modifiability, Security, Testability in Software Architecture (0)	2021.10.09
PyQT5 simple, self-contained snippet for dialog popup (0)	2021.10.06
When everyone wants to say something - MAC (Medium Access Control) (0)	2021.10.05

Hajin's blog

Availability and Performance in Software Architecture

Requirement 1: Availability

Availability tactics

Tactic 1. To prevent fault, we need to detect fault

Tactic 2: After detecting fault, we have to repair fault (Recovery)

Tactic 3: after discovering fault, we want to fix it then reintroduce the fixed server to system

Exercise: Apply availability tactics

Requirement 2: Performance

Performance tactics

Tactic 1: Reduce demand for resource (Resource control)

Tactic 2: Resource management (Response side)

Tactic 3: Resource arbitration

Example 1: apply performance tactic to online shopping

Example 2: resource arbitration

'2021 > October 2021' 카테고리의 다른 글

티스토리툴바

Availability and Performance in Software Architecture

Requirement 1: Availability

Availability tactics

Tactic 1. To prevent fault, we need to detect fault

Tactic 2: After detecting fault, we have to repair fault (Recovery)

Tactic 3: after discovering fault, we want to fix it then reintroduce the fixed server to system

Exercise: Apply availability tactics

Requirement 2: Performance

Performance tactics

Tactic 1: Reduce demand for resource (Resource control)

Tactic 2: Resource management (Response side)

Tactic 3: Resource arbitration

Example 1: apply performance tactic to online shopping

Example 2: resource arbitration

'2021 > October 2021' 카테고리의 다른 글

Related Articles

티스토리툴바