Quality Attributes of Computer Programs: Implement Software Robustness

Quality Attributes of Computer Programs: Implement Software Robustness

Table of Contents

Introduction to Robustness

Robustness is a crucial software quality attribute that measures a software’s ability to function correctly under adverse conditions.

Software adversities
Software adversities

This article explores key aspects to consider when making your software robust.

The IEEE system and software engineering vocabulary standard defines robustness as the:

degree to which a system or component can function correctly in the presence of invalid inputs or stressful environmental conditions.

ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary

This definition highlights a system’s ability to withstand external stressors. To fully grasp the scope of software robustness, we break it down into two key attributes: error tolerance and fault tolerance.

Software robustness: stressors and strategies
Software robustness: stressors and strategies

Error tolerance

Ability of a system or component to continue normal operation despite the presence of erroneous inputs.

ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary

Error tolerance is essential for robust software, ensuring that a system continues to function correctly even when it encounters invalid or unexpected inputs.

Software with good error tolerance should not fail or enter an unexpected state when faced with invalid input.

Implementing Error Tolerance

To make a program error-tolerant:

  • Input Validation: Validate and sanitize all inputs (including configurations) before processing. Check for the correct format, range, and type. It can be achieved using error handling.
  • Error Handling: Use try-except blocks (or equivalents) to catch and handle potential input errors gracefully. If there is an error, proceed with user feedback.
  • User Feedback: Provide clear, informative error messages or logs. Then, apply a fail-safe mechanism.
  • Fail-Safe Mechanisms: Ensure that the software defaults to a safe state when input errors are encountered.

Example: Handling Score Input Errors

Consider a simple program that reads a student’s score (0-100) and outputs the corresponding grade (A, B, C, D, F). It ensures robustness by handling input errors like non-numeric values or scores outside the valid range.

 1import bisect
 2
 3def score_to_grade():
 4    while True:
 5        # Read the input
 6        score_str = input("Enter the student's score (0-100): ")
 7        try:
 8            # Convert the input to float
 9            score = float(score_str)
10            # Check the score's range
11            if not (0 <= score <= 100):
12                raise ValueError("Score must be between 0 and 100")
13        except ValueError as e:
14            # Handle error from conversion or range check
15            print(f"Invalid input: {e}. Please try again.")
16        else:
17            # All went well (input validated). Sanitized input is `score`
18            break
19    
20    # Breakpoints for grades
21    breakpoints = [60, 70, 80, 90]
22    # Corresponding grades
23    grades = ["F", "D", "C", "B", "A"]
24    # Find the appropriate grade using bisect
25    the_grade = grades[bisect.bisect(breakpoints, score)]
26
27    print(f"The student's grade for {score_str} is {the_grade}")
28
29if __name__ == "__main__":
30    score_to_grade()

In this example, error tolerance is implemented as follows:

  • Convert the input to a float and check the score is within 0-100.
  • Catch any conversion or out-of-range error.
  • Inform the user and prompt the user to try again.

Fault tolerance

1. degree to which a system, product or component operates as intended despite the presence of hardware or software faults 1.
2. pertaining to the study of errors, faults, and failures, and of methods for enabling systems to continue normal operation in the presence of faults.

ISO/IEC/IEEE 24765:2017 Systems and software engineering — Vocabulary

Fault tolerance refers to a system’s ability to continue functioning correctly even when some of its components fail. This concept is vital in designing reliable software systems.

Fault tolerance is linked to the robustness of non-failed software components: ability of non-failed components to withstand the stress caused by failed components.

It involves dependency-fault tolerance, environmental tolerance, and failure tolerance.

Dependency-Fault Tolerance

Dependency-fault tolerance involves ensuring that faults in dependent components (like external libraries) do not cause software-wide failures.

Strategies to Enhance Dependency-Fault Tolerance

To improve dependency-fault tolerance, apply error tolerance strategies to the return values of API call to your dependencies.

Example: Handling Faults in a Library Call

When your software depends on an external library to perform tasks, it’s important to validate the returned values to make sure they make sense.

Let’s say you’re adding a feature to the above “score to grade” program where the score is sent to a library that fetches the school ranking for that score from a remote server.

 1import logging
 2
 3def get_ranking(api_client, score):
 4    try:
 5        ranking = api_client.get_school_ranking()
 6        
 7        # Fault tolerance: Check if the ranking is a valid positive integer
 8        if type(ranking) != int or ranking <= 0:
 9            raise ValueError("Received an invalid ranking value")
10        
11        return ranking
12    
13    except (ValueError, ConnectionError) as e:
14        # Handle the fault: log the error and return a safe default value
15        logging.error(e)
16        return None

In this example:

  • The get_ranking function uses an external API client to fetch ranking data.
  • Dependency-fault tolerance is implemented by checking if the returned ranking is a positive integer. If not, a ValueError is raised, which is then caught and handled.
  • The error is logged, and a default value is returned, ensuring that an issue with the external library doesn’t cause the whole program to fail.

Environmental Tolerance

Environmental tolerance ensures that a software continues functioning correctly under stressful or adverse conditions, such as resource limitations, resource failures, or hardware failures.

Strategies to Enhance Environmental Tolerance

To improve environmental tolerance in your software, consider the following approaches:

Approach 1 - Handling
  • Timeouts: When accessing external resources, use timeouts to prevent the programs from hanging indefinitely. Implement retry logic to handle transient errors in resource access, ensuring that temporary issues do not lead to permanent failures.
  • Retries: When accessing external resources, use timeouts to prevent the programs from hanging indefinitely. Implement retry logic to handle transient errors in resource access, ensuring that temporary issues do not lead to permanent failures.
  • User Feedback: Provide clear, informative error messages or logs.
  • Redundancy (system-level implementation): Incorporate redundancy in resource management, such as using fallback servers or redundant data storage systems, to ensure continuity of service even when some resources fail.
Example: Handling

When a program attempts to access a resource, such as a file or a network service, and that resource is unavailable or inaccessible, the program should handle this gracefully.

This could involve retrying the access, switching to a backup resource, or providing a clear error message. The actual strategy depends on the type of software.

Let’s look at how we might implement the library used in our “score to grade” program. This library uses a TCP socket connection to communicate with a remote server to get the school ranking based on the score.

 1import socket
 2import time
 3import logging
 4
 5logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
 6
 7PRIMARY_SERVER = ('primeranking.buildsoftwaresystems.com', 7000)
 8BACKUP_SERVER = ('secondranking.buildsoftwaresystems.com', 7000)
 9RETRY_COUNT = 3
10RETRY_DELAY = 2  # seconds
11
12def connect_to_server_and_get_rank(server_address, score):
13    """Attempts to connect to the specified server."""
14    try:
15        with socket.create_connection(server_address, timeout=5) as sock:
16            logging.info(f"Successfully connected to {server_address[0]}:{server_address[1]}")
17            # TODO: Send the score to the remote server and receive the school rank
18            rank = ...
19
20            return rank
21    except (socket.timeout, socket.error) as e:
22        logging.error(f"Failed to connect to {server_address[0]}:{server_address[1]} - {e}")
23        return None
24
25def get_school_rank(score):
26    for attempt in range(RETRY_COUNT):
27        rank = connect_to_server_and_get_rank(PRIMARY_SERVER, score)
28        if rank is not None:
29            return rank
30        logging.info(f"Retrying connection to primary server in {RETRY_DELAY} seconds... ({attempt + 1}/{RETRY_COUNT})")
31        time.sleep(RETRY_DELAY)
32    
33    logging.info("Primary server unavailable. Attempting to connect to backup server...")
34    rank = connect_to_server_and_get_rank(BACKUP_SERVER, score)
35    if rank is not None:
36        return rank
37
38    raise ConnectionError("Failed to connect to backup server. Please check your network or server status.")

In this example:

The connect_to_server_and_get_rank function tries to connect to the specified server and get the school rank for a given score. If the connection fails, it logs the error and returns None.

The get_school_rank function first tries to get the rank by connecting to the primary server. If that fails after a few retries, it switches to the backup server.

If both servers are unreachable, it raises a ConnectionError. This setup ensures the program can still function even if the primary server is down, while clearly signaling an issue if both servers fail.

Approach 2 - Prevention
  • Resource Monitoring: Implement monitoring systems that track the usage of critical resources like memory, disk space, and CPU. This allows the software to take proactive measures before resource exhaustion leads to failure.
  • Graceful Degradation: Design your software to degrade gracefully under resource constraints. For example, when memory is low, the software could disable non-essential features or reduce its functionality while still maintaining core operations.
  • User Feedback: Provide clear, informative error messages or logs.
Example: Prevention

A server that logs its activity to files can monitor disk usage and automatically disable logging if disk space becomes critically low.

Before disabling, it should log a warning message. Logging can be re-enabled when disk space returns to a safe level.

 1import shutil
 2import time
 3import logging
 4import threading
 5
 6logging.basicConfig(filename='server.log', level=logging.INFO, format='%(asctime)s - %(message)s')
 7
 8def get_disk_usage(path='/'):
 9    """Function to check disk usage percentage"""
10    total, used, free = shutil.disk_usage(path)
11    return (used / total) * 100
12
13def monitor_disk_usage():
14    """Main loop to monitor disk usage and control logging"""
15    thresholds = {'high': 90.0, 'low': 85.0}
16
17    while True:
18        disk_usage = get_disk_usage()
19        logging_enabled = logging.getLogger().isEnabledFor(logging.CRITICAL)
20        if disk_usage >= thresholds['high'] and logging_enabled:
21            logging.info("Disk usage critically high. Disabling logging.")
22            # Disable logging
23            logging.disable()
24        elif disk_usage < thresholds['low'] and not logging_enabled:
25            # Re-enable logging
26            logging.disable(logging.NOTSET)
27            logging.info("Disk usage back to safe level. Enabling logging.")
28        # Wait for a while before checking again
29        time.sleep(5)
30
31def server_logic():
32    # placeholder for server logic
33    while True:
34        logging.info("sleeping 1s")
35        time.sleep(1)
36
37if __name__ == "__main__":
38    threading.Thread(target=monitor_disk_usage).start()
39    threading.Thread(target=server_logic).start()

Failure Tolerance

Failure tolerance is the ability of a software to continue functioning correctly even when certain components (processes, threads, async tasks, etc.) or the whole software fail (cannot provide the required service to the user).

Strategies to Enhance Failure Tolerance

To improve failure tolerance in your software, consider implementing all the following approaches:

  • Graceful Degradation: Allow the software to continue operating with reduced functionality in the event of a partial failure. Implement this by splitting the software into independent tasks that are not directly affecting each other at runtime (e.g. OS processes).
  • Monitoring and Alerts: Set up monitoring to detect issues early and trigger alerts for quick response.
  • Automatic recovery: Implement checkpoint and restart failed processes to continue the operation from a good state.
  • Redundancy and Replication (system-level implementation): Ensure that critical services have backups, so if one fails, another can take over.
  • Automatic Failover (system-level implementation): Implement systems that can automatically switch to backup components when a failure is detected.
Example: Watchdog Process

To make an unreliable program more robust, a watchdog process can monitor and automatically restart it if it crashes.

 1import subprocess
 2import time
 3
 4def start_unreliable_process(command_list):
 5    return subprocess.Popen(command_list)
 6
 7def watchdog(unreliable_process, command_list):
 8    while True:
 9        # Check if the process is still running
10        if unreliable_process.poll() is not None:
11            # If the process has terminated, restart it
12            print("Unreliable process terminated. Restarting...")
13            unreliable_process = start_unreliable_process(command_list)
14        time.sleep(5)  # Monitor every 5 seconds
15
16def run_unreliable_process(command_list):
17    # Start the unreliable process
18    unreliable_process = start_unreliable_process(command_list)
19
20    # Start the watchdog to monitor and restart if necessary
21    watchdog(unreliable_process, command_list)

In this example:

  • The watchdog function monitors the unreliable process by checking if it’s still running.
  • If the unreliable process crashes, the watchdog detects this and calls start_unreliable_process to restart the program.

Conclusion

A robust software can handle all types of inputs and responses from external resources, even under unexpected conditions or failures.

To achieve robustness, software must be error-tolerant and fault-tolerant, including environmental and failure tolerance. It must:

  • Handle All Forms of Inputs: Validate and sanitize inputs to avoid unexpected behaviors.
  • Check for Resource Access Errors: Verify access to external resources and handle issues gracefully.
  • Implement Retry Mechanisms: For service programs, retry resource access in case of failure.
  • Ignore Non-Critical Resource Access Errors: If a retry is not appropriate or feasible, handle resource access errors in a way that minimizes disruption.
  • Use Timeouts: Prevent operations from hanging indefinitely by implementing timeouts.
  • User Feedback: Provide clear, informative error messages or logs.

For more on robustness and related quality attributes, see this article.

Do you consider robustness in your software development? How do you implement it?
Let us know in the comments below.

Related Posts

Illustrative Explanation of Fault, Error, Failure, bug, and Defect in Software

Illustrative Explanation of Fault, Error, Failure, bug, and Defect in Software

Software do not always behave as expected. Mistakes in the implementation or in the requirements specification cause issues in software.

Read More
Software Robustness and Timeout Retry Backoff Paradigms

Software Robustness and Timeout Retry Backoff Paradigms

Programs access external resources, including I/O devices and remote services. These resources can be unreliable, requiring robust handling strategies like timeouts, retries, backoff, and jitter.

Read More