-
Notifications
You must be signed in to change notification settings - Fork 219
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Steps to reproduce
- Create an SSH fleet with multiple hosts (e.g. 4 CPU hosts):
type: fleet
name: simple-cpu-fleet
ssh_config:
user: bihan
identity_file: ~/.ssh/id_rsa
hosts:
- 89.169.102.8
- 89.169.120.141
- 89.169.121.216
- 89.169.103.68
- Create a service with replicas: 4 :
type: service
name: simple-http
python: 3.12
commands:
- python3 -m http.server 8000
replicas: 4
port: 8000
resources:
cpu: 4
- After the service is running, inspect containers on each host: sudo docker ps
Actual behaviour
Several replica jobs are placed on the same instance. As below
sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
db3de0c31670 dstackai/base:0.12-base-ubuntu22.04 "/bin/sh -c '( : && …" 52 seconds ago Up 49 seconds simple-http-0-2-5594b688
8b1b74135c49 dstackai/base:0.12-base-ubuntu22.04 "/bin/sh -c '( : && …" 52 seconds ago Up 49 seconds simple-http-0-1-500c5c8d
37dbddee69cf dstackai/base:0.12-base-ubuntu22.04 "/bin/sh -c '( : && …" 52 seconds ago Up 49 seconds simple-http-0-3-b09943b7
Expected behaviour
Each instance runs one job.
dstack version
commit ff712e0
Server logs
Additional information
This also occurs with cloud fleet instances after they become idle. When idle cloud instances are reused for new jobs, the same race condition causes multiple jobs to be assigned to the same instance instead of being spread across available instances.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working