Skip to content

Add per-job hourly log quota enforced on runner#3668

Merged
peterschmidt85 merged 5 commits intomasterfrom
log-quota-per-job-hour
Mar 17, 2026
Merged

Add per-job hourly log quota enforced on runner#3668
peterschmidt85 merged 5 commits intomasterfrom
log-quota-per-job-hour

Conversation

@peterschmidt85
Copy link
Contributor

Summary

  • Adds a per-job hourly log quota (default 50MB) enforced on the runner side, preventing runaway log costs (e.g., the $194+/day CloudWatch incident from excessive training job logs)
  • Quota is configurable via DSTACK_SERVER_LOG_QUOTA_PER_JOB_HOUR env var (bytes, 0 disables)
  • Jobs exceeding the quota are killed with status error and reason log quota exceeded
  • Byte counting happens post-ANSI-stripping, matching what gets stored in CloudWatch

Changes

Go (runner)

  • types.go: Add TerminationReasonLogQuotaExceeded constant
  • schemas.go: Add LogQuotaHour field to JobSpec
  • logs.go: Add quota tracking to appendWriter with out-of-band signaling via channel (needed because ansistrip is async and swallows downstream errors)
  • executor.go: Add copyOutputWithQuota() method, wire quota via SetJob(), handle quota error in Run()
  • executor_test.go: Add TestExecutor_LogQuota

Python (server)

  • runs.py: Add LOG_QUOTA_EXCEEDED to JobTerminationReason (maps to FAILED), add log_quota_hour to JobSpec
  • settings.py: Add SERVER_LOG_QUOTA_PER_JOB_HOUR setting (default 50MB)
  • runner.py: Add log_quota_hour to SubmitBody include set
  • base.py: Add _log_quota_hour() method, wire into _get_job_spec()

Test plan

  • TestExecutor_LogQuota passes
  • All Go tests pass (go test ./...)
  • All Python tests pass (2222 passed)
  • E2E with remote backend: 1000-byte quota — job terminates immediately with log quota exceeded
  • E2E with remote backend: 50MB quota — job runs ~5 min, terminates at ~52MB with log quota exceeded
  • dstack ps -v shows status error and error log quota exceeded
  • dstack logs shows partial logs captured before termination

🤖 Generated with Claude Code

Prevents runaway log costs by limiting log output per job per calendar hour.
Default quota is 50MB/hour, configurable via DSTACK_SERVER_LOG_QUOTA_PER_JOB_HOUR
(0 disables). Jobs exceeding the quota are terminated with reason log_quota_exceeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 requested a review from un-def March 16, 2026 11:41
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Andrey Cheptsov and others added 3 commits March 17, 2026 03:38
Avoids susceptibility to wall clock adjustments (e.g., NTP sync)
by tracking elapsed hours from a monotonic start time instead of
calendar-hour boundaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lity

JobSpec is sent client→server as part of RunPlan.current_resource, so new
fields break older servers. log_quota_hour is only needed server→runner,
so it belongs on SubmitBody instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without Field(include=True), pydantic's .json() drops fields that lack
an include annotation when other fields in the model use Field(include=...).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 requested a review from un-def March 17, 2026 11:20
@peterschmidt85 peterschmidt85 merged commit 002ea51 into master Mar 17, 2026
28 checks passed
@peterschmidt85 peterschmidt85 deleted the log-quota-per-job-hour branch March 17, 2026 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants