tpu initial release by Bihan · Pull Request #1354 · dstackai/dstack

Bihan · 2024-06-24T13:35:39Z

Fixes

TPU is detected from actual workloads
--privileged flag added to docker run
Env variable PJRT_DEVICE set to TPU
This is necessary otherwise Warning is issued while running training script.
Ensure all single-VM TPUs can be used
v2, v3, v4, v5p, v5litepod are different versions of TPUs provides by GCP. Except v4, all TPU versions containing number v2-{number} where number <= 8 is a single VM TPU. v4 version are all TPU Pods
Note: To use TPU version v4, v5p and v5litepod quotas should be requested.
dstack-runner automatically sets env LD_LIBRARY_PATH in the container

Before Starting the Test

Since there is modification in dstack-shim and dstack-runner ensure that latest dstack-shim-linux-amd64 and dstack-runner binary is used.

How to test TPU

% dstack run . -b gcp --gpu tpu-v2-8
After provisioning is complete set env LD_LIBRARY_PATH in the container as below
(workflow) root@t1v-n-345435e8-w-0:~ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$(python3-config --prefix)/lib"
(workflow) root@t1v-n-345435e8-w-0:~ pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
(workflow) root@t1v-n-345435e8-w-0:~ git clone --recursive https://github.com/pytorch/xla.git
(workflow) root@t1v-n-345435e8-w-0:~ python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

Test Using Task


python: "3.11"


commands:
  - pip install torch~=2.3.0 torch_xla[tpu]~=2.3.0 torchvision -f https://storage.googleapis.com/libtpu-releases/index.html
  - git clone --recursive https://github.com/pytorch/xla.git
  - python3 xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1

# (Optional) Configure `gpu`, `memory`, `disk`, etc
resources:
  gpu:  tpu-v2-8

…into tpu_initial_release

r4victor · 2024-06-25T07:01:25Z

runner/internal/shim/docker.go

 		Entrypoint:   []string{"/bin/sh", "-c"},
 		ExposedPorts: exposePorts(dockerParams.DockerPorts()...),
+		Env: []string{
+			"PJRT_DEVICE=TPU",


Does it mean we always set PJRT_DEVICE=TPU? So even when not running TPU but CUDA? I think we should set PJRT_DEVICE=TPU when running on TPUs only.

@r4victor Should I set it as optional flag in docker run something similar to privileged_flag below?
nohup dstack-shim {dev_flag} docker --keep-container {privileged_flag} {pjrt_device}>{DSTACK_WORKING_DIR}/shim.log 2>&1 &

I'd set PJRT_DEVICE in the runner where we execute the job and set other envs:

dstack/runner/internal/executor/executor.go

Lines 196 to 207 in 62d25c3

jobEnvs := map[string]string{

"RUN_NAME": ex.run.RunName, // deprecated, remove in 0.19

"REPO_ID": ex.run.RepoId, // deprecated, remove in 0.19

"DSTACK_RUN_NAME": ex.run.RunName,

"DSTACK_REPO_ID": ex.run.RepoId,

"DSTACK_MASTER_NODE_IP": ex.clusterInfo.MasterJobIP,

"DSTACK_NODE_RANK": strconv.Itoa(node_rank),

"DSTACK_NODES_NUM": strconv.Itoa(nodes_num),

"DSTACK_GPUS_PER_NODE": strconv.Itoa(gpus_per_node_num),

"DSTACK_GPUS_NUM": strconv.Itoa(gpus_num),

}

This is good because we avoid introducing another place where we set envs. This would require passing whether tpu is used or not to the runner API but this should not be hard to do.

Setting PJRT_DEVICE via shim arg would probably work as well. Feel free to go this route if it works.

@r4victor "This is good because we avoid introducing another place where we set envs." I think is a very valid point. I will set in the runner's jobEnvs

r4victor · 2024-06-26T05:42:28Z

runner/internal/shim/docker.go

 		Entrypoint:   []string{"/bin/sh", "-c"},
 		ExposedPorts: exposePorts(dockerParams.DockerPorts()...),
+		Env: []string{
+			fmt.Sprintf("PJRT_DEVICE=%s", dockerParams.DockerPJRTDevice()),


If pjrt-device arg is not set, we still set PJRT_DEVICE="", which may not be the same as not setting PJRT_DEVICE. Let's add PJRT_DEVICE to Env only if DockerPJRTDevice() is not empty string?

This code: - is conda-specific - is dead - overwrites user/image-defined LD_LIBRARY_PATH in rare cases when it works Basically, the function (ab)uses python3-config [1] utility (which, BTW, is not present in dstack base images since conda -> uv migration) to calculate the path to conda-installed shared objects and export it via LD_LIBRARY_PATH (the proper way to add conda-installed libs would be to use ld.so's configuration files, that is, ld.so.conf.d/*). Although this code was added in #1354, it is not related to TPU support at all. [1]: https://manpages.debian.org/testing/python3-dev/python3-config.1.en.html

This code is: - conda-specific - dead Basically, the function (ab)uses python3-config [1] utility (which, BTW, is not present in dstack base images since conda -> uv migration) to calculate the path to conda-installed shared objects and export it via LD_LIBRARY_PATH (the proper way to add conda-installed libs would be to use ld.so's configuration files, that is, ld.so.conf.d/*). Although this code was added in #1354, it is not related to TPU support at all. [1]: https://manpages.debian.org/testing/python3-dev/python3-config.1.en.html

Bihan Rana added 3 commits June 24, 2024 18:47

tpu initial release

924c823

tpu initial release

82784cf

Merge branch 'tpu_initial_release' of https://github.com/Bihan/dstack …

1112faa

…into tpu_initial_release

r4victor reviewed Jun 25, 2024

View reviewed changes

Fix env variable setting mechanism

a046bfa

r4victor reviewed Jun 26, 2024

View reviewed changes

Fix conditionally set PJRT_DEVICE env variable to avoid empty string

c4e0acb

r4victor approved these changes Jun 26, 2024

View reviewed changes

r4victor merged commit e0d8906 into dstackai:master Jun 26, 2024

un-def mentioned this pull request Feb 19, 2026

[runner] Drop buildLDLibraryPathEnv() #3593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpu initial release#1354

tpu initial release#1354
r4victor merged 5 commits intodstackai:masterfrom
Bihan:tpu_initial_release

Bihan commented Jun 24, 2024 •

edited

Loading

Uh oh!

r4victor Jun 25, 2024

Uh oh!

Bihan Jun 25, 2024

Uh oh!

r4victor Jun 25, 2024

Uh oh!

Bihan Jun 25, 2024

Uh oh!

r4victor Jun 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	jobEnvs := map[string]string{
	"RUN_NAME": ex.run.RunName, // deprecated, remove in 0.19
	"REPO_ID": ex.run.RepoId, // deprecated, remove in 0.19
	"DSTACK_RUN_NAME": ex.run.RunName,
	"DSTACK_REPO_ID": ex.run.RepoId,
	"DSTACK_MASTER_NODE_IP": ex.clusterInfo.MasterJobIP,
	"DSTACK_NODE_RANK": strconv.Itoa(node_rank),
	"DSTACK_NODES_NUM": strconv.Itoa(nodes_num),
	"DSTACK_GPUS_PER_NODE": strconv.Itoa(gpus_per_node_num),
	"DSTACK_GPUS_NUM": strconv.Itoa(gpus_num),
	}

Conversation

Bihan commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

r4victor Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

r4victor Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

Bihan Jun 25, 2024

Choose a reason for hiding this comment

Uh oh!

r4victor Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bihan commented Jun 24, 2024 •

edited

Loading