Skip to content

fix: Do not add a request that is already in progress to MemoryRequestQueueClient#1384

Merged
vdusek merged 1 commit intoapify:masterfrom
Mantisus:fix-memory-storage
Sep 3, 2025
Merged

fix: Do not add a request that is already in progress to MemoryRequestQueueClient#1384
vdusek merged 1 commit intoapify:masterfrom
Mantisus:fix-memory-storage

Conversation

@Mantisus
Copy link
Collaborator

@Mantisus Mantisus commented Sep 1, 2025

Description

  • Do not add a request that is already in progress for MemoryRequestQueueClient.

Issues

@Mantisus Mantisus requested a review from vdusek September 1, 2025 21:33
@Mantisus Mantisus self-assigned this Sep 1, 2025
@Mantisus Mantisus changed the title fix: Do not add a request that is already in progress to MemoryRequestQueueClient. fix: Do not add a request that is already in progress to MemoryRequestQueueClient Sep 1, 2025
@vdusek
Copy link
Collaborator

vdusek commented Sep 2, 2025

Old version:

┌───────────────────────────────┬────────────────┐
│ requests_finished             │ 9024           │
│ requests_failed               │ 0              │
│ retry_histogram               │ [9024]         │
│ request_avg_failed_duration   │ None           │
│ request_avg_finished_duration │ 482.4ms        │
│ requests_finished_per_minute  │ 541            │
│ requests_failed_per_minute    │ 0              │
│ request_total_duration        │ 1h 12min 33.1s │
│ requests_total                │ 9024           │
│ crawler_runtime               │ 16min 40.5s    │
└───────────────────────────────┴────────────────┘

New version:

┌───────────────────────────────┬─────────────┐
│ requests_finished             │ 4512        │
│ requests_failed               │ 0           │
│ retry_histogram               │ [4512]      │
│ request_avg_failed_duration   │ None        │
│ request_avg_finished_duration │ 444.8ms     │
│ requests_finished_per_minute  │ 579         │
│ requests_failed_per_minute    │ 0           │
│ request_total_duration        │ 33min 26.9s │
│ requests_total                │ 4512        │
│ crawler_runtime               │ 7min 47.7s  │
└───────────────────────────────┴─────────────┘

With the new version, we only process about half as many requests.

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

cc @Pijukatel

@Mantisus
Copy link
Collaborator Author

Mantisus commented Sep 2, 2025

However, based on the SDK's deduplication, there should only be 2362 unique pages, so there still seems to be an issue.

2363 if the start URL is https://crawlee.dev/
4512 if the start URL is http://crawlee.dev/

This is a non-deduplication error.

In the JS version, similarly

@vdusek
Copy link
Collaborator

vdusek commented Sep 2, 2025

Oh, thanks for the explanation

@vdusek vdusek requested a review from Pijukatel September 2, 2025 11:47
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's wait for @Pijukatel as well, but I believe this is fine

@vdusek vdusek merged commit 3af326c into apify:master Sep 3, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MemoryStorageClient issue with deduplication

3 participants