Conversation
runner/internal/shim/docker.go
Outdated
| Entrypoint: []string{"/bin/sh", "-c"}, | ||
| ExposedPorts: exposePorts(dockerParams.DockerPorts()...), | ||
| Env: []string{ | ||
| "PJRT_DEVICE=TPU", |
There was a problem hiding this comment.
Does it mean we always set PJRT_DEVICE=TPU? So even when not running TPU but CUDA? I think we should set PJRT_DEVICE=TPU when running on TPUs only.
There was a problem hiding this comment.
@r4victor Should I set it as optional flag in docker run something similar to privileged_flag below?
nohup dstack-shim {dev_flag} docker --keep-container {privileged_flag} {pjrt_device}>{DSTACK_WORKING_DIR}/shim.log 2>&1 &
There was a problem hiding this comment.
I'd set PJRT_DEVICE in the runner where we execute the job and set other envs:
dstack/runner/internal/executor/executor.go
Lines 196 to 207 in 62d25c3
This is good because we avoid introducing another place where we set envs. This would require passing whether tpu is used or not to the runner API but this should not be hard to do.
Setting PJRT_DEVICE via shim arg would probably work as well. Feel free to go this route if it works.
There was a problem hiding this comment.
@r4victor "This is good because we avoid introducing another place where we set envs." I think is a very valid point. I will set in the runner's jobEnvs
runner/internal/shim/docker.go
Outdated
| Entrypoint: []string{"/bin/sh", "-c"}, | ||
| ExposedPorts: exposePorts(dockerParams.DockerPorts()...), | ||
| Env: []string{ | ||
| fmt.Sprintf("PJRT_DEVICE=%s", dockerParams.DockerPJRTDevice()), |
There was a problem hiding this comment.
If pjrt-device arg is not set, we still set PJRT_DEVICE="", which may not be the same as not setting PJRT_DEVICE. Let's add PJRT_DEVICE to Env only if DockerPJRTDevice() is not empty string?
This code: - is conda-specific - is dead - overwrites user/image-defined LD_LIBRARY_PATH in rare cases when it works Basically, the function (ab)uses python3-config [1] utility (which, BTW, is not present in dstack base images since conda -> uv migration) to calculate the path to conda-installed shared objects and export it via LD_LIBRARY_PATH (the proper way to add conda-installed libs would be to use ld.so's configuration files, that is, ld.so.conf.d/*). Although this code was added in #1354, it is not related to TPU support at all. [1]: https://manpages.debian.org/testing/python3-dev/python3-config.1.en.html
This code is: - conda-specific - dead Basically, the function (ab)uses python3-config [1] utility (which, BTW, is not present in dstack base images since conda -> uv migration) to calculate the path to conda-installed shared objects and export it via LD_LIBRARY_PATH (the proper way to add conda-installed libs would be to use ld.so's configuration files, that is, ld.so.conf.d/*). Although this code was added in #1354, it is not related to TPU support at all. [1]: https://manpages.debian.org/testing/python3-dev/python3-config.1.en.html
Fixes
--privilegedflag added to docker runPJRT_DEVICEset to TPUThis is necessary otherwise Warning is issued while running training script.
v2, v3, v4, v5p, v5litepod are different versions of TPUs provides by GCP. Except v4, all TPU versions containing number v2-{number} where number <= 8 is a single VM TPU. v4 version are all TPU Pods
Note: To use TPU version v4, v5p and v5litepod quotas should be requested.
Before Starting the Test
How to test TPU
% dstack run . -b gcp --gpu tpu-v2-8(workflow) root@t1v-n-345435e8-w-0:~ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$(python3-config --prefix)/lib"(workflow) root@t1v-n-345435e8-w-0:~ pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html(workflow) root@t1v-n-345435e8-w-0:~ git clone --recursive https://github.com/pytorch/xla.git(workflow) root@t1v-n-345435e8-w-0:~ python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1Test Using Task