Software Robustness and Timeout Retry Backoff Paradigms
- September 2, 2024
- 20 min read
- Software quality
Table of Contents
​
Introduction
Computer programs often need to access external resources, like I/O (Input/Output) devices (network interfaces, storage disks) managed by the operating system, and remote services over the Internet.
While these resources enable interaction with the outside world, they are often unreliable due to issues such as network congestion, service outages, device restarts, and any resource fault.
Note
Since unreliable resources can’t be avoided, programs must be designed to handle their unreliability. Specifically, programs need to address the following challenges:
- Resource access might take too long or never return.
- Resource access might result in an error.
- A resource might become overloaded when accessed by many programs simultaneously, leading to errors or unresponsiveness.
To make programs more robust against these issues, the following strategies are commonly used:
- Timeouts: Set a timeout for resource access to avoid waiting indefinitely. If the timeout elapses, assume an error.
- Retry Policies: If a resource access fails, retry the access based on a predefined policy that dictates when and how to retry.
- Backoff and Jitter: To prevent overloading resources or the CPU during retries, use a backoff strategy to introduce a delay before each retry. Adding jitter (random variation) to the backoff helps prevent multiple programs from overloading a resource simultaneously.
The table below summarizes common resource unreliability issues and the corresponding strategies to address them:
Resource Unreliability | Program Robustness Measure | Description |
---|---|---|
Access Never Returns | Access Timeout | Set a timeout equal to the expected timeframe for access results. If no timeframe is specified, the timeout is infinite. |
Access Error | Retry Policy | Retry access until successful for autonomous programs, or stop after a set number of attempts for interactive programs. Selectively retry based on the type of error. |
Resource Overloading | Retry with Backoff | Retry access after a delay (backoff) to prevent overloading the resources, if resource policy allows. |
Concurrent Access Overloading | Jitter in Backoff | Add random perturbation to the backoff to avoid simultaneous retries. This helps prevent worsening access errors caused by resource overload. |
Newsletter
Subscribe to our newsletter and stay updated.
​
How are Robustness Measures Applied?
Robustness measures are set up using specific parameter values.
Each resource accessed by a program has its own robustness measure, meaning different resources accessed by the same program can have different configurations. Likewise, you can apply different settings to multiple accesses of the same resource.
This flexibility allows you to adjust configurations based on how well the resource is performing.
The overall robustness measure for a program accessing a resource follows the workflow shown below:
First, the resource is accessed with a timeout, which can be either finite or infinite.
If the access is successful, the program continues as expected.
If the access times out or results in an error, the retry policy is checked.
If no retry is allowed, the program considers the access a failure and handles the error (e.g., reporting it to the user, aborting, or logging and ignoring it).
If a retry is allowed, the backoff time is calculated, and the program waits accordingly before trying to access the resource again.
​
Reliable Resource Access Strategies
​
Use of Timeout
To prevent a program from waiting indefinitely for a resource that may not respond, a response time frame, known as a timeout
, is set. If the resource doesn’t respond within the timeout period, it’s treated as an error.
It’s recommended to always use a timeout when accessing resources that might take too long to respond.
​
How is the Timeout Value Decided?
When a program tries to access a resource, it has to wait for that resource to respond before it can move on to tasks that depend on the result.
During this wait, the program is essentially “on pause” for that particular function. But if you set the timeout too short, the program might give up too quickly, leading to unnecessary failures.
So, when setting a timeout, you need to find the right balance between two things: how long the resource usually takes to respond and how quickly the program needs to stay responsive.
These two factors often pull in opposite directions. A longer timeout lowers the chances of false alarms but makes the program slower to react if something goes wrong. On the flip side, a shorter timeout keeps the program snappy but might cause it to give up too soon.
If you’re using third-party libraries to access resources, watch out for their default timeout settings. Some libraries might set a ridiculously long timeout or even an infinite one by default.
As a Bearer blog post wisely points out, “Libraries can’t be trusted.”
That’s why it’s important to set your timeout based on a careful look at how your system actually works.
For service connections and requests, a Zalando engineering blog post from July 26, 2023, offers some solid advice on setting timeouts.
You can setup a connection timeout which is some multiple of your expected RTT.
Connection timeout = RTT * 3
is commonly used as a conservative approach, but you can adjust it based on your specific needs.
In general, the connection timeout for a microservice should be set low enough so that it can quickly detect an unreachable service, but high enough to allow the service to start up or recover from a short-lived problem
They recommend setting the connection timeout based on the round-trip time (RTT) and keeping it as short as possible. You can figure out the RTT using the ping network utility tool.
Set request timeout based on collected metrics and SLA.
They suggest setting the request timeout based on the program’s service-level agreement (SLA), the resource’s SLA, and any latency metrics you’ve collected.
In the context of services, Kevin Ahlquist shared how they decided on timeout settings at Bluestem in a blog post from July 21, 2016.
1. Load test to find the service’s behavior under maximum steady-state load (i.e. how much load can this service sustain indefinitely before it tips?).
2. Gather empirical data on throughput, latency, and error rate from those load tests.
3. Set baseline timeouts (and alerts) based on this data.
4. Refine timeouts based on continued data-gathering to bring error rates to acceptable levels.
The key takeaway is to set timeout values based on solid benchmark data. For more details on how they used this data to determine timeouts, check out the full blog post.
​
Do We Retry?
When a resource access results in an error — whether due to a timeout, unavailability, or an internal issue — one way to handle it is by retrying. But before retrying, it’s important to decide if a retry is worth it. This decision is guided by a retry policy.
If the policy calls for a retry, a backoff period is calculated and used. The backoff value is determined based on backoff strategies and jitter strategies.
Two key elements control how retries are handled: retry policies and backoff delay.
​
Retry Policies
Retry policies decide whether a retry should happen after an access error. In simple terms, they define when to stop retrying. A retry policy specifies how many times (max-retries
) or for how long (max-delay
) a resource access should be retried.
A retry policy can set different max-retries
or max-delay
based on the type of error that occurred. For example, the policy might allow 10 retries for a timeout but no retries for other errors.
-
max-retries
is an integer value such that:0
means no retry.- A negative value means infinite retries.
- A positive value represents the number of retries before giving up.
-
max-delay
is a real number set similarly tomax-retries
.
​
How Are the max-retries/max-delay
Values Decided?
Before deciding on retries, consider the following:
- Errors must be transient: Don’t retry if the error is permanent, as retrying won’t fix it.
- Accesses must be idempotent: The retry should not cause any issues if repeated. The resource should act as if the failed attempts never happened.
- Consider time-sensitive accesses: There’s no point in retrying if the result would be useless by the time it’s received. For instance, if real-time data is needed within 2 seconds, retrying after that would be pointless.
This is especially true for interactive programs where quick responses are key to a good user experience, as defined by the service-level agreement (SLA).
For time-sensitive accesses, you can set the retry cut-off asmax-delay
, representing the total time elapsed since the first attempt.
Follow these general guidelines for setting retry cut-off:
- If the error is permanent or the resource access isn’t idempotent, don’t retry.
- If the program can wait indefinitely (like some microservices), retry infinitely.
- Otherwise, set
max-retries/max-delay
to a positive number based on the SLA (useful for more interactive programs).
Note
There are more advanced retry policies, such as:
- Adaptive retries (a.k.a. retry token bucket), where successful access increases retry chances and failures decrease them.
- Circuit breaker, where retries only happen if the failure rate is below a certain threshold.
Marc Brooker’s 2022 blog post evaluates various retry policies.
Setting the right policy is crucial, especially when it comes to service access.
That doesn’t mean that backing off between retries is a bad idea. It’s a good idea, but only helps for short term overload (spikes, etc). It does not reduce the total work in the system, or total amplification factor of retries.
​
Backoff Strategies
Backoff strategies are algorithms used to calculate the time to wait before retrying a failed operation. Let’s break down some common backoff strategies.
First, here are some key terms to know:
- Base Backoff Value \(v_{base}\) : This is the smallest possible backoff time. It’s the starting point for calculating the backoff. Its value is non-negative.
- Maximum Backoff Value \(v_{max}\) : This is the upper limit for the backoff time. The backoff can’t exceed this value.
- Retries Count \(x\) : The number of consecutive retries attempted so far. It starts at 0 and increases with each retry, resetting to 0 after a successful resource access.
- Backoff Time \(v_{x}\) : This is the backoff time after \(x\) retries. It’s calculated based on the specific strategy you’re using.
Also, we represent the minimum of two numbers \(a\) and \(b\) by \(min(a, b)\) .
Note
In backoff strategies, the values of
\(v_{base}\)
,
\(v_{max}\)
and
\(x\)
are the inputs used by the algorithms. The result of these algorithms is
\(v_x\)
, which is the calculated backoff time.
Essentially,
\(v_x\)
is the amount of time you’ll wait before retrying after
\(x\)
retries, as determined by the strategy you’re using.
Now, let’s explore the different backoff strategies.
​
Exponential Backoff
Exponential backoff is one of the most common strategies in distributed systems. Here, the backoff time increases exponentially with each retry.
Given a multiplier \(m\) (where \(m \ge 1\) ):
\( v_x = min(v_{max}, v_{base} * m^{x}) \)
Be cautious with exponential backoff, as it can quickly slow down your program due to the rapidly increasing wait times.
The most common value for \(m\) is 2.
Quick Note: Some implementations simplify this by always setting the multiplier to the base value, making the formula: \(v_x = min(v_{max}, (v_{base})^{x+1})\) .
​
Linear or Interval Backoff
With linear backoff, the wait time increases by a fixed interval \(L\) after each retry.
Given an interval \(L\) (where \(L \ge 0\) ):
\( v_x = min(v_{max}, v_{base} + L * x) \)
Quick Note: Some implementations always set the interval to the base value, making the formula: \(v_x = min(v_{max}, v_{base} * (x + 1))\) .
​
Fibonaccial Backoff
This strategy uses the Fibonacci sequence to calculate the backoff time, adding more variability than linear backoff.
Given the Fibonacci number \(Fib(i)\) such that \(Fib(0)\) is the first Fibonacci number:
\( v_x = min(v_{max}, v_{base} + Fib(x)) \)
​
Polynomial Backoff
Polynomial backoff uses a polynomial function to calculate the backoff time, which provides a more customizable increase in wait times.
Given \(n\) polynomial exponents \(p_{1},p_{2},...,p_{n}\) (where each \(p > 1\) ):
\( v_x = min(v_{max}, v_{base} + x^{p_{1}} + x^{p_{2}} + ... + x^{p_{n}}) \)
​
Random Backoff
Random backoff adds an element of unpredictability by selecting a random value between \(v_{base}\) and \(v_{max}\) for each retry.
\( v_x = random(v_{base}, v_{max}) \)
​
Constant Backoff
Here, the backoff time remains constant, always equal to the initial backoff value.
\( v_x = v_{base} \)
Quick Note: Constant backoff can be implemented using exponential backoff with \(m=1\) , linear backoff with \(L=0\) , or by setting \(v_{max} = v_{base}\) in any other strategy.
Quick Note: Some implementations include a No Backoff strategy, which is just a special case of constant backoff where \( v_{base} = 0 \) .
​
Use of Jitter
When multiple programs try to access the same resource simultaneously, it can lead to errors due to concurrent access.
If these programs use the same timeout and backoff retry policies, they might retry at the exact same time, causing repeated access errors.
This is especially common with servers accessed by many identical clients, like web front-ends.
When many clients send requests at once, the server can become overloaded, resulting in errors. Since these clients follow the same backoff strategy, they’ll all retry simultaneously, leading to further issues.
To prevent this, adding randomness to the backoff time — known as Jitter — is a common solution.
Jitter introduces random variations to the backoff time \( v_x \) , calculated by backoff strategies, and/or to the retry count \( x \) .
There are several algorithms to add jitter to the backoff time, resulting in a jittered_backoff
.
Below are some commonly used jitter algorithms. Each takes the backoff
\(v_{x}\)
calculated using backoff strategies and returns a jittered_backoff
. They may also modify the value of
\(x\)
.
The # retries attempted (x)
marked in red are points where the jitter had an impact, meaning randomness was added to the final backoff calculation.
​
Full Jitter
This algorithm randomly chooses either the calculated backoff time or no backoff at all. The jittered_backoff
is calculated as:
\(jittered\_backoff = random\_choice(0, backoff)\)
Where \(random\_choice(a, b)\) is a function that randomly returns on of its arguments.
​
Equal Jitter
This algorithm applies full jitter to half of the calculated backoff:
\(jittered\_backoff = (backoff / 2) + random\_choice(0, backoff / 2)\)
​
Decorrelated Jitter
This algorithm resets the retry count
\(x\)
(used as a parameter in backoff strategies) with a probability of 0.5
.
The idea is to make the backoff algorithm “forget” past retries, then use the newly calculated backoff value.
\(\text{(1) set } x = random\_choice(0, x)\text{ then }\\\text{(2) }\text{Calculate }backoff\text{ then}\\\text{(3) set } jittered\_backoff = backoff\)
​
No Jitter
No jitter is applied.
The jittered_backoff
is simply the calculated backoff:
\(jittered\_backoff = backoff\)
​
Scale the Jittered Backoff
To scale the jittered backoff by a factor, let \(f\) be a real number where \(f > 0\) . The final backoff will be:
\(final\_backoff = jittered\_backoff * f\)
This factor is useful for increasing or decreasing the computed backoff values without changing other parameters of the backoff calculation.
​
Implementation
Robust resource access strategies are usually implemented as libraries that wrap resource accesses to handle timeouts and retries. In this section, we’ll discuss how to design such libraries.
​
Core Design
At the core of the design, each resource access should be implemented within a function. The program then uses the robustness library to call this function.
The library provides a top-level function, which we’ll call access_wrapper
, that wraps the resource access.
The access_wrapper
function takes the resource access function as a parameter, referred to here as resource_access
.
You can find a sample implementation in the Implementation Example section.
​
Core of Robustness
The library should also allow configuration of timeout and retry mechanisms.
​
Timeout
Timeout should be configurable, where:
- A negative value indicates an infinite timeout.
- A non-negative value sets the actual timeout duration.
​
Retry
The retry mechanism involves several key configurations:
- Retry-policy configuration: Defines the policies to determine when to retry.
- Backoff strategy configuration: Specifies the backoff strategy to apply.
- Jitter strategy configuration: Defines the jitter strategy to add randomness to the backoff.
- Backoff scaling factor: Determines the scaling factor to adjust the backoff.
​
Retry Policies Configuration
Each retry policy includes configuration parameters to initialize its state. Important state variables might include max-retry
, max-delay
, the current number of retries, and the timestamp of the first try.
The policy defines a retry_allowed
function to decide whether to retry. This function returns a boolean and updates the retry policy state based on parameters like the current state and the observed error.
It also defines a new_access
function, which resets the retry count and the first try timestamp.
Different retry policies should be configurable for specific error types.
​
Backoff Strategy Configuration
Each backoff strategy has configuration parameters to calculate the backoff. These typically include base backoff, maximum backoff, and strategy-specific parameters like multiplier for exponential backoff.
A calculate_backoff
function computes the raw backoff value based on the number of retries and the configuration parameters.
The library should let you choose different backoff strategies for different error types, so you don’t have to use the same one for all errors.
​
Jitter strategy configuration
Each jitter strategy includes an apply_jitter
function that applies jitter to the raw backoff.
This function takes a reference to the number of retries (which it may modify) and the calculate_backoff
function. It returns the jittered_backoff
.
The library should let you choose different jitter strategies for different error types.
​
Scaling the Backoff
The jittered_backoff
is scaled by multiplying it by a configured positive factor. A scale_backoff
function performs this scaling, taking the jittered_backoff
and the scaling factor, and returning the scaled final_backoff
.
The library should let you choose different scaling factors for different error types.
​
Additional Features
Sometimes, the program needs to be aware of retries and perform specific actions upon access errors or successes.
For this, the library should allow the access_wrapper
function to accept two additional parameters—callbacks that are triggered on success or error:
on_success
: This callback is called when the resource access succeeds. It takes no parameters and returns nothing.on_error
: This callback is called when the resource access fails. It takes the observed error and possibly the number of retries as parameters, and it returns nothing.
The Python Backoff-Utils library implements these features.
​
Implementation Example
Here’s an example of how to implement timeout and retry with backoff in a sample library called robustlib
.
This library uses exponential backoff with full jitter. The retry policy is set to only retry when encountering Python’s ValueError
or TimeoutError
.
1import time
2import random
3
4# Library for the Robust Resource Access Library
5
6class RetryPolicy:
7 def __init__(self, max_retry, max_delay):
8 self.max_retry = max_retry
9 self.max_delay = max_delay
10 self.current_retries = 0
11 self.first_try_timestamp = time.time()
12
13 def retry_allowed(self, error):
14 """Determine if a retry is allowed based on the current state and error."""
15 if not isinstance(error, (ValueError, TimeoutError)):
16 # Retry only for ValueError
17 return False
18
19 if self.current_retries >= self.max_retry:
20 # Do not retry if max retries have been reached
21 return False
22
23 if time.time() - self.first_try_timestamp > self.max_delay:
24 # Do not retry if max delay have been reached
25 return False
26
27 return True
28
29 def new_access(self):
30 """Reset the retry state for a new resource access attempt."""
31 self.current_retries = 0
32 self.first_try_timestamp = time.time()
33
34 @property
35 def retry_count(self):
36 return self.current_retries
37
38 @retry_count.setter
39 def retry_count(self, new_value):
40 assert new_value >= 0, "Invalid new_value"
41 self.current_retries = new_value
42
43class BackoffStrategy:
44 def __init__(self, *, base_backoff, max_backoff, multiplier=2):
45 self.base_backoff = base_backoff
46 self.max_backoff = max_backoff
47 self.multiplier = multiplier
48
49 def calculate_backoff(self, retry_count):
50 """Calculate the backoff delay based on the retry count."""
51 backoff = min(
52 self.base_backoff * (self.multiplier ** retry_count),
53 self.max_backoff
54 )
55 return backoff
56
57class JitterStrategy:
58 def apply_jitter(self, calculate_backoff_func, retry_count_ref):
59 """Apply full jitter to the backoff calculated by the strategy."""
60 base_backoff = calculate_backoff_func(retry_count_ref.retry_count)
61 jittered_backoff = random.uniform(0, base_backoff)
62 return jittered_backoff
63
64class AccessWrapper:
65 def __init__(self, *, retry_policy, backoff_strategy, jitter_strategy, scaling_factor=1.0, timeout=-1,
66 on_success=None, on_error=None):
67 self.timeout = timeout
68 self.retry_policy = retry_policy
69 self.backoff_strategy = backoff_strategy
70 self.jitter_strategy = jitter_strategy
71 self.scaling_factor = scaling_factor
72 self.on_success = on_success
73 self.on_error = on_error
74
75 def wrap(self, resource_access):
76 """Wrap the resource access function with robustness strategies."""
77 def wrapper(*args, **kwargs):
78 retry_policy_state = self.retry_policy
79 retry_policy_state.new_access()
80
81 while True:
82 try:
83 # Attempt the resource access
84 result = resource_access(self.timeout, *args, **kwargs)
85
86 # Call on_success callback if provided
87 if self.on_success:
88 self.on_success()
89
90 return result
91 except Exception as error:
92 # Call on_error callback if provided
93 if self.on_error:
94 self.on_error(error, retry_policy_state.retry_count)
95
96 # Check if retry is allowed
97 if not retry_policy_state.retry_allowed(error):
98 raise
99
100 # Calculate backoff with jitter and scaling
101 calculate_backoff = self.backoff_strategy.calculate_backoff
102 jittered_backoff = self.jitter_strategy.apply_jitter(calculate_backoff, retry_policy_state)
103 final_backoff = jittered_backoff * self.scaling_factor
104
105 # Wait before retrying
106 time.sleep(final_backoff)
107
108 # Increment retry count
109 retry_policy_state.retry_count += 1
110
111 return wrapper
Here’s an example of how to use the robustlib
library:
1import random
2import from datetime import datetime
3import robustlib
4
5# Configure the access wrapper
6retry_policy = robustlib.RetryPolicy(max_retry=5, max_delay=60)
7backoff_strategy = robustlib.BackoffStrategy(base_backoff=1, max_backoff=32, multiplier=2)
8jitter_strategy = robustlib.JitterStrategy()
9
10access_wrapper = robustlib.AccessWrapper(
11 retry_policy=retry_policy,
12 backoff_strategy=backoff_strategy,
13 jitter_strategy=jitter_strategy,
14 scaling_factor=1.0,
15 on_success=lambda: print(f"[{datetime.now().time()}] on_success called"),
16 on_error=lambda e, c: print(f"[{datetime.now().time()}] on_error: attempt {c} failed with: {e}")
17)
18
19# Define a resource access function
20def example_resource_access(timeout):
21 # Simulate resource access that may timeout
22 if random.random() < 0.3:
23 raise TimeoutError("Simulated timeout error")
24
25 # Simulate resource access that may fail
26 if random.random() < 0.3:
27 raise ValueError("Simulated resource access error")
28
29 return "Resource accessed successfully"
30
31# Wrap the resource access function
32wrapped_resource_access = access_wrapper.wrap(example_resource_access)
33
34# Attempt to access the resource
35try:
36 result = wrapped_resource_access()
37 print(result)
38except Exception as e:
39 print(f"Failed to access resource: {e}")
Sample output:
[05:34:16.749457] on_error: attempt 0 failed with: Simulated timeout error
[05:34:16.895572] on_error: attempt 1 failed with: Simulated timeout error
[05:34:17.063954] on_error: attempt 2 failed with: Simulated resource access error
[05:34:21.044946] on_error: attempt 3 failed with: Simulated timeout error
[05:34:27.702886] on_error: attempt 4 failed with: Simulated resource access error
[05:34:41.832590] on_success called
Resource accessed successfully
Newsletter
Subscribe to our newsletter and stay updated.
​
Conclusion
A good approach to retries combines backoff, jitter, and a good retry policy.
In this guide, we’ve explored strategies to enhance program robustness during resource access, emphasizing the importance of combining timeout, backoff, jitter, and retry policies.
These elements, when properly configured, offer a flexible and adaptable approach to managing resource access, especially in distributed systems.
​
Other Applications
These strategies aren’t limited to service access — they’re also crucial in areas like network protocols, database access, file locks, and hardware interactions. Here are a few examples:
- Network Protocols: Exponential backoff with full jitter helps prevent network congestion during retries.
- Database Access: A fixed retry policy (non-infinite retries) with gradual (increasing) backoff can minimize the impact on locking and transaction states.
- File Locks: Immediate retries with minimal backoff are useful when locks are expected to be released quickly.
In addition to the common retry strategies found in libraries like Python’s Backoff-Utils, Rust’s Backoff, and Rust’s Tokio-Retry, it’s also possible to create custom strategies tailored to specific needs.
A flexible library should offer an interface that allows developers to implement these custom strategies easily.
​
Further Reading
By understanding and correctly applying these strategies, you can significantly enhance the resilience and reliability of your systems across various applications.