Quality Attributes of Computer Programs: Implement Software Robustness
- August 15, 2024
- 11 min read
- Software quality
Table of Contents
​
Introduction to Robustness
Robustness is a crucial software quality attribute that measures a software’s ability to function correctly under adverse conditions.
This article explores key aspects to consider when making your software robust.
The IEEE system and software engineering vocabulary standard defines robustness as the:
degree to which a system or component can function correctly in the presence of invalid inputs or stressful environmental conditions.
ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary
This definition highlights a system’s ability to withstand external stressors. To fully grasp the scope of software robustness, we break it down into two key attributes: error tolerance and fault tolerance.
Newsletter
Subscribe to our newsletter and stay updated.
​
Error tolerance
Ability of a system or component to continue normal operation despite the presence of erroneous inputs.
ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary
Error tolerance is essential for robust software, ensuring that a system continues to function correctly even when it encounters invalid or unexpected inputs.
Software with good error tolerance should not fail or enter an unexpected state when faced with invalid input.
​
Implementing Error Tolerance
To make a program error-tolerant:
- Input Validation: Validate and sanitize all inputs (including configurations) before processing. Check for the correct format, range, and type. It can be achieved using error handling.
- Error Handling: Use try-except blocks (or equivalents) to catch and handle potential input errors gracefully. If there is an error, proceed with user feedback.
- User Feedback: Provide clear, informative error messages or logs. Then, apply a fail-safe mechanism.
- Fail-Safe Mechanisms: Ensure that the software defaults to a safe state when input errors are encountered.
​
Example: Handling Score Input Errors
Consider a simple program that reads a student’s score (0-100) and outputs the corresponding grade (A, B, C, D, F). It ensures robustness by handling input errors like non-numeric values or scores outside the valid range.
1import bisect
2
3def score_to_grade():
4 while True:
5 # Read the input
6 score_str = input("Enter the student's score (0-100): ")
7 try:
8 # Convert the input to float
9 score = float(score_str)
10 # Check the score's range
11 if not (0 <= score <= 100):
12 raise ValueError("Score must be between 0 and 100")
13 except ValueError as e:
14 # Handle error from conversion or range check
15 print(f"Invalid input: {e}. Please try again.")
16 else:
17 # All went well (input validated). Sanitized input is `score`
18 break
19
20 # Breakpoints for grades
21 breakpoints = [60, 70, 80, 90]
22 # Corresponding grades
23 grades = ["F", "D", "C", "B", "A"]
24 # Find the appropriate grade using bisect
25 the_grade = grades[bisect.bisect(breakpoints, score)]
26
27 print(f"The student's grade for {score_str} is {the_grade}")
28
29if __name__ == "__main__":
30 score_to_grade()
In this example, error tolerance is implemented as follows:
- Convert the input to a float and check the score is within 0-100.
- Catch any conversion or out-of-range error.
- Inform the user and prompt the user to try again.
​
Fault tolerance
1. degree to which a system, product or component operates as intended despite the presence of hardware or software faults 1.
2. pertaining to the study of errors, faults, and failures, and of methods for enabling systems to continue normal operation in the presence of faults.ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary
Fault tolerance refers to a system’s ability to continue functioning correctly even when some of its components fail. This concept is vital in designing reliable software systems.
Fault tolerance is linked to the robustness of non-failed software components: ability of non-failed components to withstand the stress caused by failed components.
It involves dependency-fault tolerance, environmental tolerance, and failure tolerance.
​
Dependency-Fault Tolerance
Dependency-fault tolerance involves ensuring that faults in dependent components (like external libraries) do not cause software-wide failures.
​
Strategies to Enhance Dependency-Fault Tolerance
To improve dependency-fault tolerance, apply error tolerance strategies to the return values of API call to your dependencies.
​
Example: Handling Faults in a Library Call
When your software depends on an external library to perform tasks, it’s important to validate the returned values to make sure they make sense.
Let’s say you’re adding a feature to the above “score to grade” program where the score is sent to a library that fetches the school ranking for that score from a remote server.
1import logging
2
3def get_ranking(api_client, score):
4 try:
5 ranking = api_client.get_school_ranking()
6
7 # Fault tolerance: Check if the ranking is a valid positive integer
8 if type(ranking) != int or ranking <= 0:
9 raise ValueError("Received an invalid ranking value")
10
11 return ranking
12
13 except (ValueError, ConnectionError) as e:
14 # Handle the fault: log the error and return a safe default value
15 logging.error(e)
16 return None
In this example:
- The
get_ranking
function uses an external API client to fetch ranking data. - Dependency-fault tolerance is implemented by checking if the returned ranking is a positive integer. If not, a
ValueError
is raised, which is then caught and handled. - The error is logged, and a default value is returned, ensuring that an issue with the external library doesn’t cause the whole program to fail.
​
Environmental Tolerance
Environmental tolerance ensures that a software continues functioning correctly under stressful or adverse conditions, such as resource limitations, resource failures, or hardware failures.
​
Strategies to Enhance Environmental Tolerance
To improve environmental tolerance in your software, consider the following approaches:
​
Approach 1 - Handling
- Timeouts: When accessing external resources, use timeouts to prevent the programs from hanging indefinitely. Implement retry logic to handle transient errors in resource access, ensuring that temporary issues do not lead to permanent failures.
- Retries: When accessing external resources, use timeouts to prevent the programs from hanging indefinitely. Implement retry logic to handle transient errors in resource access, ensuring that temporary issues do not lead to permanent failures.
- User Feedback: Provide clear, informative error messages or logs.
- Redundancy (system-level implementation): Incorporate redundancy in resource management, such as using fallback servers or redundant data storage systems, to ensure continuity of service even when some resources fail.
​
Example: Handling
When a program attempts to access a resource, such as a file or a network service, and that resource is unavailable or inaccessible, the program should handle this gracefully.
This could involve retrying the access, switching to a backup resource, or providing a clear error message. The actual strategy depends on the type of software.
Let’s look at how we might implement the library used in our “score to grade” program. This library uses a TCP socket connection to communicate with a remote server to get the school ranking based on the score.
1import socket
2import time
3import logging
4
5logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
6
7PRIMARY_SERVER = ('primeranking.buildsoftwaresystems.com', 7000)
8BACKUP_SERVER = ('secondranking.buildsoftwaresystems.com', 7000)
9RETRY_COUNT = 3
10RETRY_DELAY = 2 # seconds
11
12def connect_to_server_and_get_rank(server_address, score):
13 """Attempts to connect to the specified server."""
14 try:
15 with socket.create_connection(server_address, timeout=5) as sock:
16 logging.info(f"Successfully connected to {server_address[0]}:{server_address[1]}")
17 # TODO: Send the score to the remote server and receive the school rank
18 rank = ...
19
20 return rank
21 except (socket.timeout, socket.error) as e:
22 logging.error(f"Failed to connect to {server_address[0]}:{server_address[1]} - {e}")
23 return None
24
25def get_school_rank(score):
26 for attempt in range(RETRY_COUNT):
27 rank = connect_to_server_and_get_rank(PRIMARY_SERVER, score)
28 if rank is not None:
29 return rank
30 logging.info(f"Retrying connection to primary server in {RETRY_DELAY} seconds... ({attempt + 1}/{RETRY_COUNT})")
31 time.sleep(RETRY_DELAY)
32
33 logging.info("Primary server unavailable. Attempting to connect to backup server...")
34 rank = connect_to_server_and_get_rank(BACKUP_SERVER, score)
35 if rank is not None:
36 return rank
37
38 raise ConnectionError("Failed to connect to backup server. Please check your network or server status.")
In this example:
The connect_to_server_and_get_rank
function tries to connect to the specified server and get the school rank for a given score. If the connection fails, it logs the error and returns None.
The get_school_rank
function first tries to get the rank by connecting to the primary server. If that fails after a few retries, it switches to the backup server.
If both servers are unreachable, it raises a ConnectionError
. This setup ensures the program can still function even if the primary server is down, while clearly signaling an issue if both servers fail.
​
Approach 2 - Prevention
- Resource Monitoring: Implement monitoring systems that track the usage of critical resources like memory, disk space, and CPU. This allows the software to take proactive measures before resource exhaustion leads to failure.
- Graceful Degradation: Design your software to degrade gracefully under resource constraints. For example, when memory is low, the software could disable non-essential features or reduce its functionality while still maintaining core operations.
- User Feedback: Provide clear, informative error messages or logs.
​
Example: Prevention
A server that logs its activity to files can monitor disk usage and automatically disable logging if disk space becomes critically low.
Before disabling, it should log a warning message. Logging can be re-enabled when disk space returns to a safe level.
1import shutil
2import time
3import logging
4import threading
5
6logging.basicConfig(filename='server.log', level=logging.INFO, format='%(asctime)s - %(message)s')
7
8def get_disk_usage(path='/'):
9 """Function to check disk usage percentage"""
10 total, used, free = shutil.disk_usage(path)
11 return (used / total) * 100
12
13def monitor_disk_usage():
14 """Main loop to monitor disk usage and control logging"""
15 thresholds = {'high': 90.0, 'low': 85.0}
16
17 while True:
18 disk_usage = get_disk_usage()
19 logging_enabled = logging.getLogger().isEnabledFor(logging.CRITICAL)
20 if disk_usage >= thresholds['high'] and logging_enabled:
21 logging.info("Disk usage critically high. Disabling logging.")
22 # Disable logging
23 logging.disable()
24 elif disk_usage < thresholds['low'] and not logging_enabled:
25 # Re-enable logging
26 logging.disable(logging.NOTSET)
27 logging.info("Disk usage back to safe level. Enabling logging.")
28 # Wait for a while before checking again
29 time.sleep(5)
30
31def server_logic():
32 # placeholder for server logic
33 while True:
34 logging.info("sleeping 1s")
35 time.sleep(1)
36
37if __name__ == "__main__":
38 threading.Thread(target=monitor_disk_usage).start()
39 threading.Thread(target=server_logic).start()
​
Failure Tolerance
Failure tolerance is the ability of a software to continue functioning correctly even when certain components (processes, threads, async tasks, etc.) or the whole software fail (cannot provide the required service to the user).
​
Strategies to Enhance Failure Tolerance
To improve failure tolerance in your software, consider implementing all the following approaches:
- Graceful Degradation: Allow the software to continue operating with reduced functionality in the event of a partial failure. Implement this by splitting the software into independent tasks that are not directly affecting each other at runtime (e.g. OS processes).
- Monitoring and Alerts: Set up monitoring to detect issues early and trigger alerts for quick response.
- Automatic recovery: Implement checkpoint and restart failed processes to continue the operation from a good state.
- Redundancy and Replication (system-level implementation): Ensure that critical services have backups, so if one fails, another can take over.
- Automatic Failover (system-level implementation): Implement systems that can automatically switch to backup components when a failure is detected.
​
Example: Watchdog Process
To make an unreliable program more robust, a watchdog process can monitor and automatically restart it if it crashes.
1import subprocess
2import time
3
4def start_unreliable_process(command_list):
5 return subprocess.Popen(command_list)
6
7def watchdog(unreliable_process, command_list):
8 while True:
9 # Check if the process is still running
10 if unreliable_process.poll() is not None:
11 # If the process has terminated, restart it
12 print("Unreliable process terminated. Restarting...")
13 unreliable_process = start_unreliable_process(command_list)
14 time.sleep(5) # Monitor every 5 seconds
15
16def run_unreliable_process(command_list):
17 # Start the unreliable process
18 unreliable_process = start_unreliable_process(command_list)
19
20 # Start the watchdog to monitor and restart if necessary
21 watchdog(unreliable_process, command_list)
In this example:
- The
watchdog
function monitors the unreliable process by checking if it’s still running. - If the unreliable process crashes, the watchdog detects this and calls
start_unreliable_process
to restart the program.
Newsletter
Subscribe to our newsletter and stay updated.
​
Conclusion
A robust software can handle all types of inputs and responses from external resources, even under unexpected conditions or failures.
To achieve robustness, software must be error-tolerant and fault-tolerant, including environmental and failure tolerance. It must:
- Handle All Forms of Inputs: Validate and sanitize inputs to avoid unexpected behaviors.
- Check for Resource Access Errors: Verify access to external resources and handle issues gracefully.
- Implement Retry Mechanisms: For service programs, retry resource access in case of failure.
- Ignore Non-Critical Resource Access Errors: If a retry is not appropriate or feasible, handle resource access errors in a way that minimizes disruption.
- Use Timeouts: Prevent operations from hanging indefinitely by implementing timeouts.
- User Feedback: Provide clear, informative error messages or logs.
For more on robustness and related quality attributes, see this article.
Do you consider robustness in your software development? How do you implement it?
Let us know in the comments below.