feat(client): add RetryTransport for automatic retry with exponential backoff#901
feat(client): add RetryTransport for automatic retry with exponential backoff#901cchinchilla-dev wants to merge 2 commits intoa2aproject:1.0-devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new RetryTransport for the A2A client, enabling configurable retry logic with exponential backoff and jitter for transient errors such as timeouts, network issues, and specific HTTP 5xx status codes. It includes a default_retry_predicate to determine retriable exceptions and handles both regular and streaming operations, with streaming retries limited to pre-stream failures. Comprehensive unit and integration tests have been added to validate the retry mechanism. A suggestion was made to refactor repeated ASGI middleware logic in test_retry.py into a reusable helper function for improved maintainability.
| async def transient_failure_app(scope, receive, send): | ||
| nonlocal failure_count | ||
| if scope['type'] == 'http' and failure_count < fail_limit: | ||
| failure_count += 1 | ||
| await send( | ||
| { | ||
| 'type': 'http.response.start', | ||
| 'status': 503, | ||
| 'headers': [ | ||
| [b'content-type', b'text/plain'], | ||
| ], | ||
| } | ||
| ) | ||
| await send( | ||
| { | ||
| 'type': 'http.response.body', | ||
| 'body': b'Service Unavailable', | ||
| } | ||
| ) | ||
| return | ||
| await app(scope, receive, send) |
There was a problem hiding this comment.
The transient_failure_app ASGI middleware is defined here and also in test_retry_with_jsonrpc_transport_recovers_from_503. Additionally, a similar middleware always_fail_app is defined in test_retry_exhaustion_with_persistent_503. To improve code reuse and maintainability, you could extract this logic into a helper function within this test module. A similar pattern is used in tests/integration/test_retry_integration.py with the _wrap_with_transient_503 helper, which could serve as a good example.
There was a problem hiding this comment.
Thanks for the suggestion. The current tests in this repo use inline definitions, so I’ve kept that pattern for consistency. I’m happy to refactor if the maintainers prefer extracting them.
🧪 Code Coverage (vs
|
| Base | PR | Delta | |
|---|---|---|---|
| src/a2a/client/transports/retry.py (new) | — | 92.76% | — |
| Total | 91.47% | 91.49% | 🟢 +0.03% |
Generated by coverage-comment.yml
Description
Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
CONTRIBUTINGGuide.fix:which represents bug fixes, and correlates to a SemVer patch.feat:represents a new feature, and correlates to a SemVer minor.feat!:, orfix!:,refactor!:, etc., which represent a breaking change (indicated by the!) and will result in a SemVer major.bash scripts/format.shfrom the repository root to format)Closes #871 🦕
Problem
The SDK's transports raise exceptions immediately on transient failures — network errors, timeouts, and server-side errors — with no built-in retry mechanism. Callers must implement their own retry logic at every call site, independently deciding which errors are retriable, implementing correct backoff, and inspecting cause chains for HTTP status codes.
Implementation
New
RetryTransportdecorator class insrc/a2a/client/transports/retry.pythat wraps anyClientTransportusing the decorator pattern (followingTenantTransportDecorator):Key design decisions:
ClientTransport, not via interceptors —ClientCallInterceptor.after()never sees exceptions, so interceptors cannot implement retry.default_retry_predicateclassifies errors by inspecting__cause__chains:A2AClientTimeoutError(always),httpx.RequestError(network),httpx.HTTPStatusError429/502/503/504,grpc.aio.AioRpcErrorUNAVAILABLE/RESOURCE_EXHAUSTED. Domain errors (TaskNotFoundError, etc.) are never retried.delay = random.uniform(0, min(base * 2^attempt, max)).close()bypasses retry — lifecycle operation, not data exchange.max_retries,base_delay,max_delay.retry_predicateandon_retrycallback for custom logic/logging.default_retry_predicateexported for users who want to extend it.Non-breaking, purely additive. No changes to existing code. No new dependencies.
Tests
54 unit tests in
tests/client/transports/test_retry.pycovering predicate classification, retry/no-retry behavior, backoff timing, streaming edge cases, custom predicates, callbacks, and constructor validation. Includes transport-level integration tests against Starlette servers simulating transient 503s.3 additional end-to-end tests in
tests/integration/test_retry_integration.pyexercising the full stack (ClientFactory→BaseClient→RetryTransport→ transport → server) for both REST and JSON-RPC transports.Happy to iterate on any of this based on maintainer feedback.