Here, we discuss two requirements that a software might need to meet, and how the requirements can be met through different tactics.
Requirement 1: Availability
Availability: the proportion of times client's requests can be served
We want ATM machines to work 24/7 as per specification. We expect desktop application to work 8am to 12pm, so we don’t care outside of that hour. Availability is concerned with whether the system is available to carry out what it's specification says it would meet. It's actually fine for a system to not be available for maintenance period that the client has already agreed on as part of the specification of the system - if the system is available all the other times, then it is 100% available.
Availability is inextricably linked to failure (externally observed fault), because we can define unavailability of the service as the situation in which failure occurs. Conversely, if the faults are 100% masked, then we will not see failure, so the system is 100% available.
Note: fault is part of the code that has a potential to cause incorrect behaviour
To summarise, availability is a function of failure. Specifically, we want to maximise mean time to failure, minimize mean time to repair. Failure can be of two categories: incorrect behaviour of system (incorrect output, incorrect timing) and unavailable service (does not even produce output).
Availability tactics
We want to devise a tactic to prevent, mask, repair fault so that the fault is not manifested as failure.
Tactic 1. To prevent fault, we need to detect fault
We can have a central monitor that notifies us when faults are discovered from the worker components. How do we discover faults?
Tactics method | Mechanism | Description | Evaluation |
Ping | Request from Monitor and response from worker components. | Every x seconds, inquire 'are you alive',then the component says 'yes I am' | Requires round trip of packet between monitor and worker component |
Heartbeat | Worker component sends reporting to the monitor (one-way only) | Worker is responsible for notifying monitor that it’s alive. | Requires a packet to travel from workerto monitor |
Self-test | Worker self-tests its functionality, and reports to monitor if it malfunctions. Actually, monitor could tell worker to self-test, but this wasn't mentioned. | Relies on the component to self-test | |
Exception | error in component -> raise exception -> related part (eg monitor) is informed of the problem so it takes necessary actions. | ||
Timestamp | Client puts timestamp on packets | Ensures sequence of events are received and sent (packets go from client to server) |
Overhead with Ping and Heartbeat
Ping, heartbeat have overhead, since both are actually packets to be sent.
Solution: Just include heartbeat/ping messages to our usual message, to minimize overhead in network. Piggybacking the heartbeat on other packets that are sent to communicate relevant data to the task, we minimize network overhead.
Tactic 2: After detecting fault, we have to repair fault (Recovery)
Detecting fault can allow us to intervene (or allow autonomous systems to intervene) to fix the fault. However, to ensure availability, we don't want that intervention to be manifested as a delayed response. In other words, we want to make sure that the system still produces correct output to the client's response in the correct timing, and this is also called masking the fault. We usually do so by redundancy (having many servers that can provide the same service).
Tactics Method | Mechanism | Description |
Active redundancy | Maintain several active servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. They all attempt to process client request at the same time | If the main server doesn’t respond, other server (with the exact same state)’s response will be used by the client |
Passive redundancy | Maintain several servers along with one main server. All servers are on and share the same state so can process theclient request equivalently. Non-main servers will only attempt to process client request when required | If the main server doesn't respond, the request is made to next available server, and so on until request can be served. |
Voting | Maintain a voting system in the middle between client and servers. | Client sends a request to the voting system, where the voting system all servers' response. Then, the answer that majority agrees on is passed to the client. The minority is suspected as a failed server. |
Spare | This is mostly manual replacement of malfunctioning server or components of the server. | This obviously takes time. When the main server is down, Spare tactic is either booting up dormant server, or replacing some components to those that work. |
Tactic 3: after discovering fault, we want to fix it then reintroduce the fixed server to system
The following actions are all probably necessary in order to get a fixed server to be back on its job.
Tactics Method | Description | |
Shadow | Run the fixed server in parallel to the currently running servers, and ensure that their behaviours are equivalent | |
State resynchronisation | Ensure state of the currently running servers is the same as state of the reintroduced, fixed server | |
Rollback | Rollback all the states of the system to valid states. |
Exercise: Apply availability tactics
Imagine the scenario below:
This is the thought process to answer this question:
The purpose of tactic is to make sure that stimulus (crash) produces response (recording failure), while accounting for the measure (within 10 minutes, no loss of service).
Measure here requires:
1. No loss of service => availability: redundancy (server crashed)
- Automated redundancy: active/passive redundancy doesn't lose service (the loss is almost not visible)
2. Failure recorded within 10 minutes => need failure detection
- Failure detection: Heartbeat/ping detects failure, where HB/ping period is less than 10 minutes.
- Switching to available server might require failure detection, so the HB/Ping period might have to be frequent.
Requirement 2: Performance
Performance: time taken to correctly answer client request
Measuring performance depends on processing time (time for system to work towards response) and blocked time (time for which there is no progress to the response)
Processing time
Factor 1: Resource consumption.
Description: Resource here refers to hardware (CPU, data storage, memory, network) and software (buffer management). It is important to consider impact of resource consumption
Examples: Overloading CPU increases processing time. Heavy memory consumption leads to software crash hence increased processing time
Factor 2: Some things that are not directly related to the task
Examples: You might have already prepared the response for client request, but you might need to do marshalling to send data to network, taking time.
Blocked time
Factor 1: Multiple clients needing same resource -> contention, deadlock, delays
Factor 2: Resource unavailable -> this resource could be a component in failure state
Factor 3: dependency on other components -> wait for other components to finish its computation
Performance tactics
We want to ensure that event (stimuli, can be periodic/sporadic/stochastic) is processed within some time
Tactic 1: Reduce demand for resource (Resource control)
There are three ways to achieve performance tactics:
1. Reduce resources required for processing stimuli
Tactic action | For example... | How it helps performance |
Increase computational efficiency (eg use better algorithm) | reduces resources needed |
reduce CPU/memory load |
Reduce computational overhead | minimizes intercomponent communications | Obvious |
2. Decrease number of events to process
Tactic action | For example... | How it helps performance |
Decrease the frequency of sampling | where real time data update is not necessary | Less sampling - less blocking/processing time |
Reduce number of events to process | - DDOS could be avoided if the server just drops new requests while under load - Use cache |
Obvious |
3. Control resources consumed
Tactic action | For example... | How it helps performance |
Bound execution time | if processing request takes long, we just drop processing that request. (note specification of the system might not accept this) Eg shop vendor might be busy, so ‘please come later’. “please visit later”. |
Obvious |
Bound queue sizes | We can have a queue, so that if the queue is full we just drop request. We can increase queue too, if more needs to be processed | Obvious |
Tactic 2: Resource management (Response side)
Methods | Description |
Increase computational efficiency | We can do searching on database side rather than loading everything in. Faster search algorithm. |
increase available resource | Add new server, increase configuration for server, by using larger memory or better CPU, might move to cloud. Can have many servers to support concurrent search -> load balancer |
Database efficiency | Multiple copies of database: copy of data in every server, so no need to connect to remote database but use local |
Tactic 3: Resource arbitration
decides which request to allocate resource to – set resource allocation policy
Example 1: apply performance tactic to online shopping
Example 2: resource arbitration
Allocating resource to time sensitive process: alarm when fire:
- use either semantic importance (better) or earliest deadline first policy (set deadline very early so that it's processed earliest) for scheduling alarm
'2021 > October 2021' 카테고리의 다른 글
Hierarchical routing (iBGP, eBGP, OSPF, RIP) (0) | 2021.10.09 |
---|---|
Client and server architecture (0) | 2021.10.09 |
Modifiability, Security, Testability in Software Architecture (0) | 2021.10.09 |
PyQT5 simple, self-contained snippet for dialog popup (0) | 2021.10.06 |
When everyone wants to say something - MAC (Medium Access Control) (0) | 2021.10.05 |