Fix retries when uploading databases#3564
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes database upload retries by avoiding reuse of a partially-consumed ReadStream across retry attempts, keeping uploads streamed (rather than buffering multi-GB artifacts) while ensuring each attempt uses a fresh stream.
Changes:
- Implement manual retry logic for database uploads with a fresh
fs.createReadStreamper attempt and disable Octokit per-request retries for that call. - Centralize “do not retry” HTTP statuses in
src/api-client.tsfor reuse across the codebase. - Update/add unit tests and document the fix in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/database-upload.ts | Replaces Octokit retry behavior for database uploads with a manual retry loop that recreates the upload stream per attempt. |
| src/database-upload.test.ts | Updates tests to distinguish retryable vs non-retryable failures and avoids real sleep delays during backoff. |
| src/api-client.ts | Exports DO_NOT_RETRY_STATUSES and reuses it in Octokit retry configuration. |
| src/api-client.test.ts | Updates expected retry configuration to reference DO_NOT_RETRY_STATUSES. |
| CHANGELOG.md | Adds an UNRELEASED entry describing the database upload retry fix. |
| lib/* | Generated JavaScript output updated to match TypeScript changes (not reviewed). |
esbena
left a comment
There was a problem hiding this comment.
LGTM, with some minor insight improvement suggestions.
src/database-upload.ts
Outdated
| if (!isRetryable || attempt === maxAttempts) { | ||
| throw e; |
There was a problem hiding this comment.
Minor debugging improvement.
| if (!isRetryable || attempt === maxAttempts) { | |
| throw e; | |
| if (!isRetryable) { | |
| throw e; | |
| } else if (attempt === maxAttempts) { | |
| logger.error(`Maximum retry attempts exhausted (${attempt}), re-throwing otherwise retryable error...`) | |
| throw e; |
| language, | ||
| zipped_upload_size_bytes: bundledDbSize, | ||
| is_overlay_base: shouldUploadOverlayBase, | ||
| upload_duration_ms: uploadDurationMs, |
There was a problem hiding this comment.
I wonder: Do we separately and in general track how many failures we have for API requests that succeed on retry?
There was a problem hiding this comment.
We don't, but that's a good idea, although for another PR :)
mbg
left a comment
There was a problem hiding this comment.
This broadly looks good and makes sense. I have a few suggestions, particularly around the "high risk" assessment. I'd suggest we add a FF just in case, which shouldn't be too hard? Otherwise, we probably do want to make sure that we have reasonable test coverage here.
src/database-upload.ts
Outdated
| const maxAttempts = 4; // 1 initial attempt + 3 retries, identical to the default retry behavior of Octokit | ||
| let uploadDurationMs: number | undefined; | ||
| for (let attempt = 1; attempt <= maxAttempts; attempt++) { | ||
| const bundledDbReadStream = fs.createReadStream(bundledDb); |
There was a problem hiding this comment.
I see that you marked this PR as high risk. It doesn't seem like it would be too hard to refactor this slightly so that the old code / loop body goes into a new function and then depending on a new FF we either just call that directly or in the loop?
There was a problem hiding this comment.
I replied to this as part of your overall review comment
| bundledDbReadStream.close(); | ||
| ); | ||
| uploadDurationMs = performance.now() - attemptStartTime; | ||
| break; |
There was a problem hiding this comment.
Minor: maybe a mild preference for using some boolean to indicate when the upload was successful in the loop condition over using break, since that can be brittle if we refactor this in the future. That might not matter / be necessary if you move the upload code into a separate function as suggested in my other comment, though.
I disagree — we hit this error very rarely, and the problem is remedied by a retry. I don't think we'd get much signal with a feature flag. The changes require careful review, but I think a feature flag is overkill. I'd suggest instead we check the telemetry in a month's time and make sure we're no longer seeing "Response body object should not be disturbed or locked". |
mbg
left a comment
There was a problem hiding this comment.
The changes since my last review look good, thanks!
I see that you reacted with 👍🏻 to this comment, but I don't see it addressed yet -- did you mean to?
I disagree — we hit this error very rarely, and the problem is remedied by a retry. I don't think we'd get much signal with a feature flag.
I don't necessarily agree with this line of reasoning, because the upload code itself gets hit by most analyses and any changes to it will affect them all. Whether or not the upload needs to be retried is not relevant to the reasoning for/against a FF. That said, I don't feel strongly enough about this to be blocking in this instance since we have test coverage which shows that the changes to the code behave as expected in the "no retry" case. Adding a FF also introduces additional changes and so there's an argument that may outweigh any benefit here.
Thanks, pushed. |
The current retry mechanism would pass a read stream that might have already been partially consumed. Instead, use a custom retry mechanism and pass a fresh read stream on each attempt.
An alternative that wouldn't need the custom retry mechanism would have been to read the database into a buffer, but databases might be multiple GBs so I wanted to keep a streaming approach.
Risk assessment
For internal use only. Please select the risk level of this change:
Which use cases does this change impact?
Workflow types:
dynamicworkflows (Default Setup, Code Quality, ...).Products:
analysis-kinds: code-scanning.Environments:
fd.fxlcf.dpdns.organd/or GitHub Enterprise Cloud with Data Residency.How did/will you validate this change?
.test.tsfiles).If something goes wrong after this change is released, what are the mitigation and rollback strategies?
How will you know if something goes wrong after this change is released?
Are there any special considerations for merging or releasing this change?
Merge / deployment checklist