Skip to content

[Bug]: Multiple replica jobs assigned to same instance due to busy_blocks race #3683

@Bihan

Description

@Bihan

Steps to reproduce

  1. Create an SSH fleet with multiple hosts (e.g. 4 CPU hosts):
type: fleet
name: simple-cpu-fleet

ssh_config:
  user: bihan
  identity_file: ~/.ssh/id_rsa
  hosts:
    - 89.169.102.8
    - 89.169.120.141
    - 89.169.121.216
    - 89.169.103.68
  1. Create a service with replicas: 4 :
type: service
name: simple-http
python: 3.12

commands:
  - python3 -m http.server 8000

replicas: 4
port: 8000
resources:
  cpu: 4
  1. After the service is running, inspect containers on each host: sudo docker ps

Actual behaviour

Several replica jobs are placed on the same instance. As below

sudo docker ps
CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS          PORTS     NAMES
db3de0c31670   dstackai/base:0.12-base-ubuntu22.04   "/bin/sh -c '( : && …"   52 seconds ago   Up 49 seconds             simple-http-0-2-5594b688
8b1b74135c49   dstackai/base:0.12-base-ubuntu22.04   "/bin/sh -c '( : && …"   52 seconds ago   Up 49 seconds             simple-http-0-1-500c5c8d
37dbddee69cf   dstackai/base:0.12-base-ubuntu22.04   "/bin/sh -c '( : && …"   52 seconds ago   Up 49 seconds             simple-http-0-3-b09943b7

Expected behaviour

Each instance runs one job.

dstack version

commit ff712e0

Server logs

Additional information

This also occurs with cloud fleet instances after they become idle. When idle cloud instances are reused for new jobs, the same race condition causes multiple jobs to be assigned to the same instance instead of being spread across available instances.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions