Skip to content

compute: Surface prometheus metrics through introspection#35256

Open
antiguru wants to merge 17 commits intoMaterializeInc:mainfrom
antiguru:prom_metrics
Open

compute: Surface prometheus metrics through introspection#35256
antiguru wants to merge 17 commits intoMaterializeInc:mainfrom
antiguru:prom_metrics

Conversation

@antiguru
Copy link
Member

@antiguru antiguru commented Feb 27, 2026

Add mz_compute_prometheus_metrics_per_worker to the mz_introspection schema, exposing per-replica Prometheus metrics as queryable SQL data.

A custom timely source operator periodically calls registry.gather(), diffs against the previous snapshot, and emits changes into an arranged trace.
The scrape interval is controlled by the compute_prometheus_scrape_interval dyncfg (default 30s).
Histograms are flattened into bucket/sum/count rows and summaries into quantile/sum/count rows, matching the Prometheus text exposition format.

Schema: metric_name text, metric_type text, labels map[text=>text], value float8, help text, worker_id uint8.

A cluster mzcompose test (workflow_test_prometheus_metrics) validates that the source populates with counter and gauge metrics on a single-worker replica with the scrape interval lowered to 1s.

@github-actions
Copy link

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@antiguru antiguru marked this pull request as ready for review February 27, 2026 21:25
@antiguru antiguru requested review from a team as code owners February 27, 2026 21:25
@antiguru antiguru requested a review from SangJunBak February 27, 2026 21:25
@antiguru antiguru requested review from a team as code owners March 1, 2026 20:02
@antiguru antiguru requested a review from teskje March 2, 2026 10:00
@antiguru antiguru force-pushed the prom_metrics branch 2 times, most recently from 415f6f2 to aee8039 Compare March 3, 2026 09:06
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the docs so far, I'll probably wait until the per-process (instead of per-worker) change is in.

I'm also wondering if we should feature flag the introspection source? Seems like it has the potential to be quite expensive, with all the strings. Especially if we include the help text in every update.

@antiguru antiguru requested a review from teskje March 3, 2026 14:40
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have mostly nits, but we should not call the introspection source _per_worker.

worker_config: Rc<ConfigSet>,
workers_per_process: usize,
) -> Return {
let variant = LogVariant::Compute(ComputeLog::PrometheusMetrics);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always assumed that the LogVariant variants match the logging dataflows that produce them, so I would have expected this to be LogVariant::Prometheus. But then I realized LogVariant::Timely(Reachability) also violates my imagined rule, so I guess the rule doesn't exist.

So nothing to do for this PR, but it seems like we could consider removing the {Timely, Differential, Compute} split in LogVariant since it doesn't seem to serve any purpose.

antiguru and others added 7 commits March 11, 2026 20:24
Add a new `mz_compute_prometheus_metrics_per_worker` introspection table
that exposes per-replica Prometheus metrics as queryable SQL data. A custom
timely source operator periodically calls `registry.gather()`, diffs
against the previous snapshot, and emits changes into an arranged trace.

The scrape interval is controlled by the `compute_prometheus_scrape_interval`
dyncfg (default 30s). Histograms are flattened into bucket/sum/count rows
and summaries into quantile/sum/count rows, matching the Prometheus text
exposition format.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from CapacityContainerBuilder<Vec<_>> to ColumnBuilder for the
source operator output, matching the columnar pattern used elsewhere.
Pass references instead of cloning snapshot data into sessions. Use
data.drain() in the unary operator to take ownership of columnar data.

Fix the scrape timing to use a next_scrape instant gated by the dyncfg
interval, and reschedule based on the logging interval rather than the
scrape interval.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `workflow_test_prometheus_metrics` to the cluster mzcompose tests.
The test creates an unorchestrated single-worker replica, sets the
scrape interval to 1s, and verifies that
`mz_compute_prometheus_metrics_per_worker` contains rows with both
counter and gauge metric types and the expected worker_id.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only scrape prometheus metrics on one worker per process by dropping the
capability for disabled workers so the frontier can advance. Fix the
inverted scrape timing condition. Replace `as f64` casts with
`f64::cast_lossy` to satisfy clippy. Update SLT, testdrive, and
datadriven test expectations for the new introspection source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add catalog documentation for the new introspection source and register
compute_prometheus_scrape_interval as uninteresting in parallel-workload
and mzcompose test flag lists. Update expected index count in
catalog.td.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the separate scrape interval configuration flag and scrape
metrics every time the logging operator wakes (controlled by the
logging interval). Guard against spurious wakeups using the logging
interval.

Add RELATION_SPEC_NO_COMMENTS marker to lint-docs-catalog.py for
relations whose catalog does not store column comments, such as
BuiltinLog items.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
antiguru and others added 10 commits March 11, 2026 20:24
Rename `worker_id` column to `process_id` and compute it as
`scope.index() / scope.peers()` to correctly reflect that metrics
are per-process, not per-worker.

Add `ENABLE_COMPUTE_PROMETHEUS_METRICS` dyncfg that can disable
scraping at runtime. The flag is checked on each scrape cycle
inside the operator closure.

Refactor `RELATION_SPEC_NO_COMMENTS` into a `NO_COMMENTS` modifier
on `RELATION_SPEC` in lint-docs-catalog.py.

Update documentation: add summary flattening note, remove `untyped`
from metric_type description, update column name and description.

Expand the cluster test to use a multi-process cluster (scale=2,
workers=2) and verify both process IDs appear.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define key `(metric_name, labels, process_id)` for the
PrometheusMetrics relation, matching the unique identity of a
metric time series per process.

Remove the unused `include_comments = True` initialization in
lint-docs-catalog.py, since `include_comments` is always assigned
from the modifier check before use.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `enable_compute_prometheus_metrics` to uninteresting system
parameters in both mzcompose and parallel-workload to satisfy the
lint-test-flags check.

Note in the documentation that metrics with the legacy Prometheus
type `untyped` may appear.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous code used `scope.index() / scope.peers()` to compute the
process ID, but `scope.peers()` returns the total number of workers
across all processes, not per-process. This caused process_id to always
be 0 and only one worker globally to scrape metrics.

Thread `workers_per_process` from `TimelyConfig::workers` through the
call chain to `prometheus::construct()`, and use it to correctly compute
`process_id = scope.index() / workers_per_process` and
`enable = scope.index() % workers_per_process == 0`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge the "PrometheusMetrics" source operator and the "ToRow
PrometheusMetrics" unary operator into a single source operator that
packs Row pairs directly. This eliminates the intermediate
`(SnapshotKey, SnapshotValue)` tuple representation and the label
serialization round-trip (serialize to `k=v,k=v` string, then parse
back), following the pattern already used by `compute.rs`.

Additional optimizations:
* Store label pairs as `Vec<(String, String)>` in the snapshot key
  instead of a serialized string, removing `parse_labels` entirely.
* Take `name` and `labels` by value in `insert_row` since owned values
  are needed for the snapshot map.
* Use `&'static str` for `metric_type` in `SnapshotValue` since
  type strings are always literals.
* Take `MetricFamily` vec by value in `flatten_metrics` to avoid
  borrowing the `gather()` result.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Rename relation from `mz_compute_prometheus_metrics_per_worker` to
  `mz_compute_prometheus_metrics` since it is per-process.
* Move `process_id` to the first column for consistency with other
  logging relations that put identifier columns first.
* Rename dyncfg to `enable_compute_prometheus_metrics_introspection`
  to avoid ambiguity.
* Downgrade capability to the next-scrape timestamp instead of the
  current time, avoiding frontier churn from premature wakeups.
* Retract existing data when the dyncfg is disabled instead of
  leaving stale rows, matching the documented behavior.
* Soft-panic on unexpected untyped metrics instead of silently
  ingesting them.
* Simplify `labels.sort_by(...)` to `labels.sort()`.
* Use a managed cluster in the mzcompose test to reduce boilerplate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `enable_compute_prometheus_metrics_introspection` (bool) to
`compute_prometheus_scrape_interval` (Duration, default 1s).
The operator uses `max(prom_interval, logging_interval)` as the
effective scrape interval. Setting the interval to 0s disables
scraping and retracts existing data.

Add a test that verifies setting the interval to 0s empties the
relation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update parallel_workload and mzcompose to use the renamed
`compute_prometheus_scrape_interval` flag, and rewrite the
introspection relations SLT to reflect the merged operator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include "introspection" in the flag name to clarify it controls the
Prometheus introspection source. Also remove the premature-wakeup
early return from the operator, since it only self-activates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants