Skip to content

feat(client): add RetryTransport for automatic retry with exponential backoff#901

Open
cchinchilla-dev wants to merge 2 commits intoa2aproject:1.0-devfrom
cchinchilla-dev:feat/retry-transport-871
Open

feat(client): add RetryTransport for automatic retry with exponential backoff#901
cchinchilla-dev wants to merge 2 commits intoa2aproject:1.0-devfrom
cchinchilla-dev:feat/retry-transport-871

Conversation

@cchinchilla-dev
Copy link
Contributor

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
    • Important Prefixes for release-please:
      • fix: which represents bug fixes, and correlates to a SemVer patch.
      • feat: represents a new feature, and correlates to a SemVer minor.
      • feat!:, or fix!:, refactor!:, etc., which represent a breaking change (indicated by the !) and will result in a SemVer major.
  • Ensure the tests and linter pass (Run bash scripts/format.sh from the repository root to format)
  • Appropriate docs were updated (if necessary)

Closes #871 🦕

Problem

The SDK's transports raise exceptions immediately on transient failures — network errors, timeouts, and server-side errors — with no built-in retry mechanism. Callers must implement their own retry logic at every call site, independently deciding which errors are retriable, implementing correct backoff, and inspecting cause chains for HTTP status codes.

Implementation

New RetryTransport decorator class in src/a2a/client/transports/retry.py that wraps any ClientTransport using the decorator pattern (following TenantTransportDecorator):

from a2a.client.transports.retry import RetryTransport

inner = JsonRpcTransport(httpx_client=client, agent_card=card)
transport = RetryTransport(
    base=inner,
    max_retries=3,
    base_delay=1.0,
    max_delay=30.0,
)

async with transport:
    result = await transport.send_message(params)

Key design decisions:

  • Decorator over ClientTransport, not via interceptors — ClientCallInterceptor.after() never sees exceptions, so interceptors cannot implement retry.
  • default_retry_predicate classifies errors by inspecting __cause__ chains: A2AClientTimeoutError (always), httpx.RequestError (network), httpx.HTTPStatusError 429/502/503/504, grpc.aio.AioRpcError UNAVAILABLE/RESOURCE_EXHAUSTED. Domain errors (TaskNotFoundError, etc.) are never retried.
  • Streaming retries only pre-stream failures; once the first event is yielded, errors propagate as-is.
  • Exponential backoff with full jitter: delay = random.uniform(0, min(base * 2^attempt, max)).
  • close() bypasses retry — lifecycle operation, not data exchange.
  • Constructor validation for max_retries, base_delay, max_delay.
  • Configurable retry_predicate and on_retry callback for custom logic/logging.
  • default_retry_predicate exported for users who want to extend it.

Non-breaking, purely additive. No changes to existing code. No new dependencies.

Tests

54 unit tests in tests/client/transports/test_retry.py covering predicate classification, retry/no-retry behavior, backoff timing, streaming edge cases, custom predicates, callbacks, and constructor validation. Includes transport-level integration tests against Starlette servers simulating transient 503s.

3 additional end-to-end tests in tests/integration/test_retry_integration.py exercising the full stack (ClientFactoryBaseClientRetryTransport → transport → server) for both REST and JSON-RPC transports.


Happy to iterate on any of this based on maintainer feedback.

@cchinchilla-dev cchinchilla-dev requested a review from a team as a code owner March 25, 2026 21:49
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a RetryTransport decorator to the client SDK, providing a robust and configurable mechanism for automatically retrying operations that encounter transient failures. This enhancement significantly improves the client's resilience against network issues, timeouts, and temporary server-side errors, reducing the need for callers to implement custom retry logic and making the SDK more reliable out-of-the-box.

Highlights

  • New RetryTransport Decorator: Introduced a new RetryTransport decorator class that wraps any ClientTransport to provide automatic retry logic for transient failures.
  • Configurable Retry Predicate and Backoff: Implemented a default_retry_predicate to classify retriable errors (e.g., network errors, timeouts, specific HTTP status codes like 429, 502-504, gRPC UNAVAILABLE/RESOURCE_EXHAUSTED) and an exponential backoff strategy with full jitter for delay calculation. Users can provide custom predicates and on_retry callbacks.
  • Streaming Operation Handling: Designed streaming methods (send_message_streaming, subscribe) to only retry failures that occur before the stream starts, ensuring that errors during an active stream propagate immediately.
  • Comprehensive Testing: Added 54 unit tests covering predicate classification, retry/no-retry behavior, backoff timing, streaming edge cases, custom logic, and constructor validation. Additionally, 3 end-to-end integration tests were included to validate the full client stack with transient server failures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new RetryTransport for the A2A client, enabling configurable retry logic with exponential backoff and jitter for transient errors such as timeouts, network issues, and specific HTTP 5xx status codes. It includes a default_retry_predicate to determine retriable exceptions and handles both regular and streaming operations, with streaming retries limited to pre-stream failures. Comprehensive unit and integration tests have been added to validate the retry mechanism. A suggestion was made to refactor repeated ASGI middleware logic in test_retry.py into a reusable helper function for improved maintainability.

Comment on lines +653 to +673
async def transient_failure_app(scope, receive, send):
nonlocal failure_count
if scope['type'] == 'http' and failure_count < fail_limit:
failure_count += 1
await send(
{
'type': 'http.response.start',
'status': 503,
'headers': [
[b'content-type', b'text/plain'],
],
}
)
await send(
{
'type': 'http.response.body',
'body': b'Service Unavailable',
}
)
return
await app(scope, receive, send)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

The transient_failure_app ASGI middleware is defined here and also in test_retry_with_jsonrpc_transport_recovers_from_503. Additionally, a similar middleware always_fail_app is defined in test_retry_exhaustion_with_persistent_503. To improve code reuse and maintainability, you could extract this logic into a helper function within this test module. A similar pattern is used in tests/integration/test_retry_integration.py with the _wrap_with_transient_503 helper, which could serve as a good example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. The current tests in this repo use inline definitions, so I’ve kept that pattern for consistency. I’m happy to refactor if the maintainers prefer extracting them.

@github-actions
Copy link

github-actions bot commented Mar 25, 2026

🧪 Code Coverage (vs 1.0-dev)

⬇️ Download Full Report

Base PR Delta
src/a2a/client/transports/retry.py (new) 92.76%
Total 91.47% 91.49% 🟢 +0.03%

Generated by coverage-comment.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant