This is the multi-page printable view of this section. Click here to print.
Retry and back-off resiliency policies
1 - Retry resiliency policies
Requests can fail due to transient errors, like encountering network congestion, reroutes to overloaded instances, and more. Sometimes, requests can fail due to other resiliency policies set in place, like triggering a defined timeout or circuit breaker policy.
In these cases, configuring retries
can either:
- Send the same request to a different instance, or
- Retry sending the request after the condition has cleared.
Retries and timeouts work together, with timeouts ensuring your system fails fast when needed, and retries recovering from temporary glitches.
Dapr provides default resiliency policies, which you can overwrite with user-defined retry policies.
Important
Each pub/sub component has its own built-in retry behaviors. Explicity applying a Dapr resiliency policy doesn’t override these implicit retry policies. Rather, the resiliency policy augments the built-in retry, which can cause repetitive clustering of messages.Retry policy format
Example 1
spec:
policies:
# Retries are named templates for retry configurations and are instantiated for life of the operation.
retries:
pubsubRetry:
policy: constant
duration: 5s
maxRetries: 10
retryForever:
policy: exponential
maxInterval: 15s
maxRetries: -1 # Retry indefinitely
Example 2
spec:
policies:
retries:
retry5xxOnly:
policy: constant
duration: 5s
maxRetries: 3
matching:
httpStatusCodes: "429,500-599" # retry the HTTP status codes in this range. All others are not retried.
gRPCStatusCodes: "1-4,8-11,13,14" # retry gRPC status codes in these ranges and separate single codes.
Spec metadata
The following retry options are configurable:
Retry option | Description |
---|---|
policy | Determines the back-off and retry interval strategy. Valid values are constant and exponential .Defaults to constant . |
duration | Determines the time interval between retries. Only applies to the constant policy.Valid values are of the form 200ms , 15s , 2m , etc.Defaults to 5s . |
maxInterval | Determines the maximum interval between retries to which the exponential back-off policy can grow.Additional retries always occur after a duration of maxInterval . Defaults to 60s . Valid values are of the form 5s , 1m , 1m30s , etc |
maxRetries | The maximum number of retries to attempt.-1 denotes an unlimited number of retries, while 0 means the request will not be retried (essentially behaving as if the retry policy were not set).Defaults to -1 . |
matching.httpStatusCodes | Optional: a comma-separated string of HTTP status codes or code ranges to retry. Status codes not listed are not retried. Valid values: 100-599, Reference Format: <code> or range <start>-<end> Example: “429,501-503” Default: empty string "" or field is not set. Retries on all HTTP errors. |
matching.gRPCStatusCodes | Optional: a comma-separated string of gRPC status codes or code ranges to retry. Status codes not listed are not retried. Valid values: 0-16, Reference Format: <code> or range <start>-<end> Example: “4,8,14” Default: empty string "" or field is not set. Retries on all gRPC errors. |
Exponential back-off policy
The exponential back-off window uses the following formula:
BackOffDuration = PreviousBackOffDuration * (Random value from 0.5 to 1.5) * 1.5
if BackOffDuration > maxInterval {
BackoffDuration = maxInterval
}
Retry status codes
When applications span multiple services, especially on dynamic environments like Kubernetes, services can disappear for all kinds of reasons and network calls can start hanging. Status codes provide a glimpse into our operations and where they may have failed in production.
HTTP
The following table includes some examples of HTTP status codes you may receive and whether you should or should not retry certain operations.
HTTP Status Code | Retry Recommended? | Description |
---|---|---|
404 Not Found | â No | The resource doesn’t exist. |
400 Bad Request | â No | Your request is invalid. |
401 Unauthorized | â No | Try getting new credentials. |
408 Request Timeout | â Yes | The server timed out waiting for the request. |
429 Too Many Requests | â Yes | (Respect the Retry-After header, if present). |
500 Internal Server Error | â Yes | The server encountered an unexpected condition. |
502 Bad Gateway | â Yes | A gateway or proxy received an invalid response. |
503 Service Unavailable | â Yes | Service might recover. |
504 Gateway Timeout | â Yes | Temporary network issue. |
gRPC
The following table includes some examples of gRPC status codes you may receive and whether you should or should not retry certain operations.
gRPC Status Code | Retry Recommended? | Description |
---|---|---|
Code 1 CANCELLED | â No | N/A |
Code 3 INVALID_ARGUMENT | â No | N/A |
Code 4 DEADLINE_EXCEEDED | â Yes | Retry with backoff |
Code 5 NOT_FOUND | â No | N/A |
Code 8 RESOURCE_EXHAUSTED | â Yes | Retry with backoff |
Code 14 UNAVAILABLE | â Yes | Retry with backoff |
Retry filter based on status codes
The retry filter enables granular control over retry policies by allowing users to specify HTTP and gRPC status codes or ranges for which retries should apply.
spec:
policies:
retries:
retry5xxOnly:
# ...
matching:
httpStatusCodes: "429,500-599" # retry the HTTP status codes in this range. All others are not retried.
gRPCStatusCodes: "4,8-11,13,14" # retry gRPC status codes in these ranges and separate single codes.
Note
Field values for status codes must follow the format specified above. An incorrectly formatted value produces an error log (“Could not read resiliency policy”) and thedaprd
startup sequence will proceed.Demo
Watch a demo presented during Diagrid’s Dapr v1.15 celebration to see how to set retry status code filters using Diagrid Conductor
Next steps
- [Learn how to override default retry policies for specific APIs.]({[< ref override-default-retries.md >]})
- Learn how to target your retry policies from the resiliency spec.
- Learn more about:
Related links
Try out one of the Resiliency quickstarts:
2 - Override default retry resiliency policies
Dapr provides default retries for any unsuccessful request, such as failures and transient errors. Within a resiliency spec, you have the option to override Dapr’s default retry logic by defining policies with reserved, named keywords. For example, defining a policy with the name DaprBuiltInServiceRetries
, overrides the default retries for failures between sidecars via service-to-service requests. Policy overrides are not applied to specific targets.
Note: Although you can override default values with more robust retries, you cannot override with lesser values than the provided default value, or completely remove default retries. This prevents unexpected downtime.
Below is a table that describes Dapr’s default retries and the policy keywords to override them:
Capability | Override Keyword | Default Retry Behavior | Description |
---|---|---|---|
Service Invocation | DaprBuiltInServiceRetries | Per call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times. | Sidecar-to-sidecar requests (a service invocation method call) that fail and result in a gRPC code Unavailable or Unauthenticated |
Actors | DaprBuiltInActorRetries | Per call retries are performed with a backoff interval of 1 second, up to a threshold of 3 times. | Sidecar-to-sidecar requests (an actor method call) that fail and result in a gRPC code Unavailable or Unauthenticated |
Actor Reminders | DaprBuiltInActorReminderRetries | Per call retries are performed with an exponential backoff with an initial interval of 500ms, up to a maximum of 60s for a duration of 15mins | Requests that fail to persist an actor reminder to a state store |
Initialization Retries | DaprBuiltInInitializationRetries | Per call retries are performed 3 times with an exponential backoff, an initial interval of 500ms and for a duration of 10s | Failures when making a request to an application to retrieve a given spec. For example, failure to retrieve a subscription, component or resiliency specification |
The resiliency spec example below shows overriding the default retries for all service invocation requests by using the reserved, named keyword ‘DaprBuiltInServiceRetries’.
Also defined is a retry policy called ‘retryForever’ that is only applied to the appB target. appB uses the ‘retryForever’ retry policy, while all other application service invocation retry failures use the overridden ‘DaprBuiltInServiceRetries’ default policy.
spec:
policies:
retries:
DaprBuiltInServiceRetries: # Overrides default retry behavior for service-to-service calls
policy: constant
duration: 5s
maxRetries: 10
retryForever: # A user defined retry policy replaces default retries. Targets rely solely on the applied policy.
policy: exponential
maxInterval: 15s
maxRetries: -1 # Retry indefinitely
targets:
apps:
appB: # app-id of the target service
retry: retryForever
Related links
Try out one of the Resiliency quickstarts: