Software Robustness and Timeout Retry Backoff Paradigms

Software Robustness and Timeout Retry Backoff Paradigms

Table of Contents

Introduction

Computer programs often need to access external resources, like I/O (Input/Output) devices (network interfaces, storage disks) managed by the operating system, and remote services over the Internet.

While these resources enable interaction with the outside world, they are often unreliable due to issues such as network congestion, service outages, device restarts, and any resource fault.

Note

Logical resources like locks, especially spinlocks, also fall into the category of unreliable resources due to their trial-and-error acquisition nature.

Since unreliable resources can’t be avoided, programs must be designed to handle their unreliability. Specifically, programs need to address the following challenges:

  1. Resource access might take too long or never return.
  2. Resource access might result in an error.
  3. A resource might become overloaded when accessed by many programs simultaneously, leading to errors or unresponsiveness.

To make programs more robust against these issues, the following strategies are commonly used:

  1. Timeouts: Set a timeout for resource access to avoid waiting indefinitely. If the timeout elapses, assume an error.
    Timeout Illustrated
    Timeout Illustration
  2. Retry Policies: If a resource access fails, retry the access based on a predefined policy that dictates when and how to retry.
  3. Backoff and Jitter: To prevent overloading resources or the CPU during retries, use a backoff strategy to introduce a delay before each retry. Adding jitter (random variation) to the backoff helps prevent multiple programs from overloading a resource simultaneously.
    Retry with backoff illustrated
    Retry with Backoff Illustration

The table below summarizes common resource unreliability issues and the corresponding strategies to address them:

Resource UnreliabilityProgram Robustness MeasureDescription
Access Never ReturnsAccess TimeoutSet a timeout equal to the expected timeframe for access results. If no timeframe is specified, the timeout is infinite.
Access ErrorRetry PolicyRetry access until successful for autonomous programs, or stop after a set number of attempts for interactive programs. Selectively retry based on the type of error.
Resource OverloadingRetry with BackoffRetry access after a delay (backoff) to prevent overloading the resources, if resource policy allows.
Concurrent Access OverloadingJitter in BackoffAdd random perturbation to the backoff to avoid simultaneous retries. This helps prevent worsening access errors caused by resource overload.

How are Robustness Measures Applied?

Robustness measures are set up using specific parameter values.

Each resource accessed by a program has its own robustness measure, meaning different resources accessed by the same program can have different configurations. Likewise, you can apply different settings to multiple accesses of the same resource.

This flexibility allows you to adjust configurations based on how well the resource is performing.

The overall robustness measure for a program accessing a resource follows the workflow shown below:

Overall robustness measure workflow
Overall robustness measure workflow

First, the resource is accessed with a timeout, which can be either finite or infinite.

If the access is successful, the program continues as expected.

If the access times out or results in an error, the retry policy is checked.

If no retry is allowed, the program considers the access a failure and handles the error (e.g., reporting it to the user, aborting, or logging and ignoring it).

If a retry is allowed, the backoff time is calculated, and the program waits accordingly before trying to access the resource again.

Reliable Resource Access Strategies

Use of Timeout

To prevent a program from waiting indefinitely for a resource that may not respond, a response time frame, known as a timeout, is set. If the resource doesn’t respond within the timeout period, it’s treated as an error.

It’s recommended to always use a timeout when accessing resources that might take too long to respond.

How is the Timeout Value Decided?

When a program tries to access a resource, it has to wait for that resource to respond before it can move on to tasks that depend on the result.

During this wait, the program is essentially “on pause” for that particular function. But if you set the timeout too short, the program might give up too quickly, leading to unnecessary failures.

So, when setting a timeout, you need to find the right balance between two things: how long the resource usually takes to respond and how quickly the program needs to stay responsive.

These two factors often pull in opposite directions. A longer timeout lowers the chances of false alarms but makes the program slower to react if something goes wrong. On the flip side, a shorter timeout keeps the program snappy but might cause it to give up too soon.

If you’re using third-party libraries to access resources, watch out for their default timeout settings. Some libraries might set a ridiculously long timeout or even an infinite one by default.

As a Bearer blog post wisely points out, “Libraries can’t be trusted.”

That’s why it’s important to set your timeout based on a careful look at how your system actually works.

For service connections and requests, a Zalando engineering blog post from July 26, 2023, offers some solid advice on setting timeouts.

You can setup a connection timeout which is some multiple of your expected RTT. Connection timeout = RTT * 3 is commonly used as a conservative approach, but you can adjust it based on your specific needs.
In general, the connection timeout for a microservice should be set low enough so that it can quickly detect an unreachable service, but high enough to allow the service to start up or recover from a short-lived problem

Zalando engineers

They recommend setting the connection timeout based on the round-trip time (RTT) and keeping it as short as possible. You can figure out the RTT using the ping network utility tool.

Set request timeout based on collected metrics and SLA.

Zalando engineers

They suggest setting the request timeout based on the program’s service-level agreement (SLA), the resource’s SLA, and any latency metrics you’ve collected.

In the context of services, Kevin Ahlquist shared how they decided on timeout settings at Bluestem in a blog post from July 21, 2016.

1. Load test to find the service’s behavior under maximum steady-state load (i.e. how much load can this service sustain indefinitely before it tips?).
2. Gather empirical data on throughput, latency, and error rate from those load tests.
3. Set baseline timeouts (and alerts) based on this data.
4. Refine timeouts based on continued data-gathering to bring error rates to acceptable levels.

Kevin Ahlquist

The key takeaway is to set timeout values based on solid benchmark data. For more details on how they used this data to determine timeouts, check out the full blog post.

Do We Retry?

When a resource access results in an error — whether due to a timeout, unavailability, or an internal issue — one way to handle it is by retrying. But before retrying, it’s important to decide if a retry is worth it. This decision is guided by a retry policy.

If the policy calls for a retry, a backoff period is calculated and used. The backoff value is determined based on backoff strategies and jitter strategies.

Two key elements control how retries are handled: retry policies and backoff delay.

Retry Policies

Retry policies decide whether a retry should happen after an access error. In simple terms, they define when to stop retrying. A retry policy specifies how many times (max-retries) or for how long (max-delay) a resource access should be retried.

A retry policy can set different max-retries or max-delay based on the type of error that occurred. For example, the policy might allow 10 retries for a timeout but no retries for other errors.

  • max-retries is an integer value such that:

    • 0 means no retry.
    • A negative value means infinite retries.
    • A positive value represents the number of retries before giving up.
  • max-delay is a real number set similarly to max-retries.

How Are the max-retries/max-delay Values Decided?

Before deciding on retries, consider the following:

  • Errors must be transient: Don’t retry if the error is permanent, as retrying won’t fix it.
  • Accesses must be idempotent: The retry should not cause any issues if repeated. The resource should act as if the failed attempts never happened.
  • Consider time-sensitive accesses: There’s no point in retrying if the result would be useless by the time it’s received. For instance, if real-time data is needed within 2 seconds, retrying after that would be pointless.

    This is especially true for interactive programs where quick responses are key to a good user experience, as defined by the service-level agreement (SLA).

    For time-sensitive accesses, you can set the retry cut-off as max-delay, representing the total time elapsed since the first attempt.

Follow these general guidelines for setting retry cut-off:

  1. If the error is permanent or the resource access isn’t idempotent, don’t retry.
  2. If the program can wait indefinitely (like some microservices), retry infinitely.
  3. Otherwise, set max-retries/max-delay to a positive number based on the SLA (useful for more interactive programs).

Note

There are more advanced retry policies, such as:

Marc Brooker’s 2022 blog post evaluates various retry policies.

Setting the right policy is crucial, especially when it comes to service access.

That doesn’t mean that backing off between retries is a bad idea. It’s a good idea, but only helps for short term overload (spikes, etc). It does not reduce the total work in the system, or total amplification factor of retries.

Marc Brooker

Backoff Strategies

Backoff strategies are algorithms used to calculate the time to wait before retrying a failed operation. Let’s break down some common backoff strategies.

First, here are some key terms to know:

  • Base Backoff Value \(v_{base}\) : This is the smallest possible backoff time. It’s the starting point for calculating the backoff. Its value is non-negative.
  • Maximum Backoff Value \(v_{max}\) : This is the upper limit for the backoff time. The backoff can’t exceed this value.
  • Retries Count \(x\) : The number of consecutive retries attempted so far. It starts at 0 and increases with each retry, resetting to 0 after a successful resource access.
  • Backoff Time \(v_{x}\) : This is the backoff time after \(x\) retries. It’s calculated based on the specific strategy you’re using.

Also, we represent the minimum of two numbers \(a\) and \(b\) by \(min(a, b)\) .

Note

In backoff strategies, the values of \(v_{base}\) , \(v_{max}\) and \(x\) are the inputs used by the algorithms. The result of these algorithms is \(v_x\) , which is the calculated backoff time.

Essentially, \(v_x\) is the amount of time you’ll wait before retrying after \(x\) retries, as determined by the strategy you’re using.

Now, let’s explore the different backoff strategies.

All backoff strategies
Backoff strategies with \(v_{base}=1, v_{max}=60, m=2, L=2, p1=2\)
Exponential Backoff

Exponential backoff is one of the most common strategies in distributed systems. Here, the backoff time increases exponentially with each retry.

Given a multiplier \(m\) (where \(m \ge 1\) ):

\( v_x = min(v_{max}, v_{base} * m^{x}) \)

Be cautious with exponential backoff, as it can quickly slow down your program due to the rapidly increasing wait times.

The most common value for \(m\) is 2.

Quick Note: Some implementations simplify this by always setting the multiplier to the base value, making the formula: \(v_x = min(v_{max}, (v_{base})^{x+1})\) .

Linear or Interval Backoff

With linear backoff, the wait time increases by a fixed interval \(L\) after each retry.

Given an interval \(L\) (where \(L \ge 0\) ):

\( v_x = min(v_{max}, v_{base} + L * x) \)

Quick Note: Some implementations always set the interval to the base value, making the formula: \(v_x = min(v_{max}, v_{base} * (x + 1))\) .

Fibonaccial Backoff

This strategy uses the Fibonacci sequence to calculate the backoff time, adding more variability than linear backoff.

Given the Fibonacci number \(Fib(i)\) such that \(Fib(0)\) is the first Fibonacci number:

\( v_x = min(v_{max}, v_{base} + Fib(x)) \)

Polynomial Backoff

Polynomial backoff uses a polynomial function to calculate the backoff time, which provides a more customizable increase in wait times.

Given \(n\) polynomial exponents \(p_{1},p_{2},...,p_{n}\) (where each \(p > 1\) ):

\( v_x = min(v_{max}, v_{base} + x^{p_{1}} + x^{p_{2}} + ... + x^{p_{n}}) \)

Random Backoff

Random backoff adds an element of unpredictability by selecting a random value between \(v_{base}\) and \(v_{max}\) for each retry.

\( v_x = random(v_{base}, v_{max}) \)

Constant Backoff

Here, the backoff time remains constant, always equal to the initial backoff value.

\( v_x = v_{base} \)

Quick Note: Constant backoff can be implemented using exponential backoff with \(m=1\) , linear backoff with \(L=0\) , or by setting \(v_{max} = v_{base}\) in any other strategy.

Quick Note: Some implementations include a No Backoff strategy, which is just a special case of constant backoff where \( v_{base} = 0 \) .

Use of Jitter

When multiple programs try to access the same resource simultaneously, it can lead to errors due to concurrent access.

If these programs use the same timeout and backoff retry policies, they might retry at the exact same time, causing repeated access errors.

This is especially common with servers accessed by many identical clients, like web front-ends.

When many clients send requests at once, the server can become overloaded, resulting in errors. Since these clients follow the same backoff strategy, they’ll all retry simultaneously, leading to further issues.

To prevent this, adding randomness to the backoff time — known as Jitter — is a common solution.

Jitter introduces random variations to the backoff time \( v_x \) , calculated by backoff strategies, and/or to the retry count \( x \) .

There are several algorithms to add jitter to the backoff time, resulting in a jittered_backoff.

Below are some commonly used jitter algorithms. Each takes the backoff \(v_{x}\) calculated using backoff strategies and returns a jittered_backoff. They may also modify the value of \(x\) .

All jitter algorithms
Jitter algorithms illustrated for Exponential backoff, with \(v_{base}=1, v_{max}=60, m=2\)

The # retries attempted (x) marked in red are points where the jitter had an impact, meaning randomness was added to the final backoff calculation.

Illustration of Jitter Usefulness
Illustration of Jitter Usefulness: Exponential backoff with Full Jitter
Full Jitter

This algorithm randomly chooses either the calculated backoff time or no backoff at all. The jittered_backoff is calculated as:

\(jittered\_backoff = random\_choice(0, backoff)\)

Where \(random\_choice(a, b)\) is a function that randomly returns on of its arguments.

Equal Jitter

This algorithm applies full jitter to half of the calculated backoff:

\(jittered\_backoff = (backoff / 2) + random\_choice(0, backoff / 2)\)

Decorrelated Jitter

This algorithm resets the retry count \(x\) (used as a parameter in backoff strategies) with a probability of 0.5.

The idea is to make the backoff algorithm “forget” past retries, then use the newly calculated backoff value.

\(\text{(1) set } x = random\_choice(0, x)\text{ then }\\\text{(2) }\text{Calculate }backoff\text{ then}\\\text{(3) set } jittered\_backoff = backoff\)

No Jitter

No jitter is applied. The jittered_backoff is simply the calculated backoff:

\(jittered\_backoff = backoff\)

Scale the Jittered Backoff

To scale the jittered backoff by a factor, let \(f\) be a real number where \(f > 0\) . The final backoff will be:

\(final\_backoff = jittered\_backoff * f\)

This factor is useful for increasing or decreasing the computed backoff values without changing other parameters of the backoff calculation.

Implementation

Robust resource access strategies are usually implemented as libraries that wrap resource accesses to handle timeouts and retries. In this section, we’ll discuss how to design such libraries.

Core Design

At the core of the design, each resource access should be implemented within a function. The program then uses the robustness library to call this function.

The library provides a top-level function, which we’ll call access_wrapper, that wraps the resource access.

The access_wrapper function takes the resource access function as a parameter, referred to here as resource_access.

You can find a sample implementation in the Implementation Example section.

Core of Robustness

The library should also allow configuration of timeout and retry mechanisms.

Timeout

Timeout should be configurable, where:

  • A negative value indicates an infinite timeout.
  • A non-negative value sets the actual timeout duration.

Retry

The retry mechanism involves several key configurations:

  • Retry-policy configuration: Defines the policies to determine when to retry.
  • Backoff strategy configuration: Specifies the backoff strategy to apply.
  • Jitter strategy configuration: Defines the jitter strategy to add randomness to the backoff.
  • Backoff scaling factor: Determines the scaling factor to adjust the backoff.
Retry Policies Configuration

Each retry policy includes configuration parameters to initialize its state. Important state variables might include max-retry, max-delay, the current number of retries, and the timestamp of the first try.

The policy defines a retry_allowed function to decide whether to retry. This function returns a boolean and updates the retry policy state based on parameters like the current state and the observed error.

It also defines a new_access function, which resets the retry count and the first try timestamp.

Different retry policies should be configurable for specific error types.

Backoff Strategy Configuration

Each backoff strategy has configuration parameters to calculate the backoff. These typically include base backoff, maximum backoff, and strategy-specific parameters like multiplier for exponential backoff.

A calculate_backoff function computes the raw backoff value based on the number of retries and the configuration parameters.

The library should let you choose different backoff strategies for different error types, so you don’t have to use the same one for all errors.

Jitter strategy configuration

Each jitter strategy includes an apply_jitter function that applies jitter to the raw backoff.

This function takes a reference to the number of retries (which it may modify) and the calculate_backoff function. It returns the jittered_backoff.

The library should let you choose different jitter strategies for different error types.

Scaling the Backoff

The jittered_backoff is scaled by multiplying it by a configured positive factor. A scale_backoff function performs this scaling, taking the jittered_backoff and the scaling factor, and returning the scaled final_backoff.

The library should let you choose different scaling factors for different error types.

Additional Features

Sometimes, the program needs to be aware of retries and perform specific actions upon access errors or successes.

For this, the library should allow the access_wrapper function to accept two additional parameters—callbacks that are triggered on success or error:

  • on_success: This callback is called when the resource access succeeds. It takes no parameters and returns nothing.
  • on_error: This callback is called when the resource access fails. It takes the observed error and possibly the number of retries as parameters, and it returns nothing.

The Python Backoff-Utils library implements these features.

Implementation Example

Here’s an example of how to implement timeout and retry with backoff in a sample library called robustlib.

This library uses exponential backoff with full jitter. The retry policy is set to only retry when encountering Python’s ValueError or TimeoutError.

  1import time
  2import random
  3
  4# Library for the Robust Resource Access Library
  5
  6class RetryPolicy:
  7    def __init__(self, max_retry, max_delay):
  8        self.max_retry = max_retry
  9        self.max_delay = max_delay
 10        self.current_retries = 0
 11        self.first_try_timestamp = time.time()
 12    
 13    def retry_allowed(self, error):
 14        """Determine if a retry is allowed based on the current state and error."""
 15        if not isinstance(error, (ValueError, TimeoutError)):
 16            # Retry only for ValueError
 17            return False  
 18
 19        if self.current_retries >= self.max_retry:
 20            # Do not retry if max retries have been reached
 21            return False  
 22
 23        if time.time() - self.first_try_timestamp > self.max_delay:
 24            # Do not retry if max delay have been reached
 25            return False  
 26
 27        return True
 28    
 29    def new_access(self):
 30        """Reset the retry state for a new resource access attempt."""
 31        self.current_retries = 0
 32        self.first_try_timestamp = time.time()
 33
 34    @property
 35    def retry_count(self):
 36        return self.current_retries
 37
 38    @retry_count.setter
 39    def retry_count(self, new_value):
 40        assert new_value >= 0, "Invalid new_value"
 41        self.current_retries = new_value
 42
 43class BackoffStrategy:
 44    def __init__(self, *, base_backoff, max_backoff, multiplier=2):
 45        self.base_backoff = base_backoff
 46        self.max_backoff = max_backoff
 47        self.multiplier = multiplier
 48    
 49    def calculate_backoff(self, retry_count):
 50        """Calculate the backoff delay based on the retry count."""
 51        backoff = min(
 52            self.base_backoff * (self.multiplier ** retry_count), 
 53            self.max_backoff
 54        )
 55        return backoff
 56
 57class JitterStrategy:
 58    def apply_jitter(self, calculate_backoff_func, retry_count_ref):
 59        """Apply full jitter to the backoff calculated by the strategy."""
 60        base_backoff = calculate_backoff_func(retry_count_ref.retry_count)
 61        jittered_backoff = random.uniform(0, base_backoff)
 62        return jittered_backoff
 63
 64class AccessWrapper:
 65    def __init__(self, *, retry_policy, backoff_strategy, jitter_strategy, scaling_factor=1.0, timeout=-1, 
 66                 on_success=None, on_error=None):
 67        self.timeout = timeout
 68        self.retry_policy = retry_policy
 69        self.backoff_strategy = backoff_strategy
 70        self.jitter_strategy = jitter_strategy
 71        self.scaling_factor = scaling_factor
 72        self.on_success = on_success
 73        self.on_error = on_error
 74    
 75    def wrap(self, resource_access):
 76        """Wrap the resource access function with robustness strategies."""
 77        def wrapper(*args, **kwargs):
 78            retry_policy_state = self.retry_policy
 79            retry_policy_state.new_access()
 80
 81            while True:
 82                try:
 83                    # Attempt the resource access
 84                    result = resource_access(self.timeout, *args, **kwargs)
 85
 86                    # Call on_success callback if provided
 87                    if self.on_success:
 88                        self.on_success()
 89
 90                    return result
 91                except Exception as error:
 92                    # Call on_error callback if provided
 93                    if self.on_error:
 94                        self.on_error(error, retry_policy_state.retry_count)
 95                    
 96                    # Check if retry is allowed
 97                    if not retry_policy_state.retry_allowed(error):
 98                        raise
 99
100                    # Calculate backoff with jitter and scaling
101                    calculate_backoff = self.backoff_strategy.calculate_backoff
102                    jittered_backoff = self.jitter_strategy.apply_jitter(calculate_backoff, retry_policy_state)
103                    final_backoff = jittered_backoff * self.scaling_factor
104
105                    # Wait before retrying
106                    time.sleep(final_backoff)
107                    
108                    # Increment retry count
109                    retry_policy_state.retry_count += 1
110        
111        return wrapper

Here’s an example of how to use the robustlib library:

 1import random
 2import from datetime import datetime
 3import robustlib
 4
 5# Configure the access wrapper
 6retry_policy = robustlib.RetryPolicy(max_retry=5, max_delay=60)
 7backoff_strategy = robustlib.BackoffStrategy(base_backoff=1, max_backoff=32, multiplier=2)
 8jitter_strategy = robustlib.JitterStrategy()
 9
10access_wrapper = robustlib.AccessWrapper(
11    retry_policy=retry_policy,
12    backoff_strategy=backoff_strategy,
13    jitter_strategy=jitter_strategy,
14    scaling_factor=1.0,
15    on_success=lambda: print(f"[{datetime.now().time()}] on_success called"),
16    on_error=lambda e, c: print(f"[{datetime.now().time()}] on_error: attempt {c} failed with: {e}")
17)
18
19# Define a resource access function
20def example_resource_access(timeout):
21    # Simulate resource access that may timeout
22    if random.random() < 0.3:
23        raise TimeoutError("Simulated timeout error")
24
25    # Simulate resource access that may fail
26    if random.random() < 0.3:
27        raise ValueError("Simulated resource access error")
28
29    return "Resource accessed successfully"
30
31# Wrap the resource access function
32wrapped_resource_access = access_wrapper.wrap(example_resource_access)
33
34# Attempt to access the resource
35try:
36    result = wrapped_resource_access()
37    print(result)
38except Exception as e:
39    print(f"Failed to access resource: {e}")

Sample output:

[05:34:16.749457] on_error: attempt 0 failed with: Simulated timeout error
[05:34:16.895572] on_error: attempt 1 failed with: Simulated timeout error
[05:34:17.063954] on_error: attempt 2 failed with: Simulated resource access error
[05:34:21.044946] on_error: attempt 3 failed with: Simulated timeout error
[05:34:27.702886] on_error: attempt 4 failed with: Simulated resource access error
[05:34:41.832590] on_success called
Resource accessed successfully

Conclusion

A good approach to retries combines backoff, jitter, and a good retry policy.

Marc Brooker

In this guide, we’ve explored strategies to enhance program robustness during resource access, emphasizing the importance of combining timeout, backoff, jitter, and retry policies.

These elements, when properly configured, offer a flexible and adaptable approach to managing resource access, especially in distributed systems.

Other Applications

These strategies aren’t limited to service access — they’re also crucial in areas like network protocols, database access, file locks, and hardware interactions. Here are a few examples:

  • Network Protocols: Exponential backoff with full jitter helps prevent network congestion during retries.
  • Database Access: A fixed retry policy (non-infinite retries) with gradual (increasing) backoff can minimize the impact on locking and transaction states.
  • File Locks: Immediate retries with minimal backoff are useful when locks are expected to be released quickly.

In addition to the common retry strategies found in libraries like Python’s Backoff-Utils, Rust’s Backoff, and Rust’s Tokio-Retry, it’s also possible to create custom strategies tailored to specific needs.

A flexible library should offer an interface that allows developers to implement these custom strategies easily.

Further Reading

By understanding and correctly applying these strategies, you can significantly enhance the resilience and reliability of your systems across various applications.

Related Posts

Quality Attributes of Computer Programs: Implement Software Robustness

Quality Attributes of Computer Programs: Implement Software Robustness

Robustness is a crucial software quality attribute that measures a software’s ability to function correctly under adverse conditions.

Read More
Illustrative Explanation of Fault, Error, Failure, bug, and Defect in Software

Illustrative Explanation of Fault, Error, Failure, bug, and Defect in Software

Software do not always behave as expected. Mistakes in the implementation or in the requirements specification cause issues in software.

Read More