compute: Surface prometheus metrics through introspection by antiguru · Pull Request #35256 · MaterializeInc/materialize

antiguru · 2026-02-27T21:18:08Z

Add mz_compute_prometheus_metrics_per_worker to the mz_introspection schema, exposing per-replica Prometheus metrics as queryable SQL data.

A custom timely source operator periodically calls registry.gather(), diffs against the previous snapshot, and emits changes into an arranged trace.
The scrape interval is controlled by the compute_prometheus_scrape_interval dyncfg (default 30s).
Histograms are flattened into bucket/sum/count rows and summaries into quantile/sum/count rows, matching the Prometheus text exposition format.

Schema: metric_name text, metric_type text, labels map[text=>text], value float8, help text, worker_id uint8.

A cluster mzcompose test (workflow_test_prometheus_metrics) validates that the source populates with counter and gauge metrics on a single-worker replica with the scrape interval lowered to 1s.

github-actions · 2026-02-27T21:18:17Z

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

The PR title is descriptive and will make sense in the git log.
This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

src/compute/src/logging/prometheus.rs

src/compute-types/src/dyncfgs.rs

src/compute/src/logging/prometheus.rs

teskje

Only looked at the docs so far, I'll probably wait until the per-process (instead of per-worker) change is in.

I'm also wondering if we should feature flag the introspection source? Seems like it has the potential to be quite expensive, with all the strings. Especially if we include the help text in every update.

ci/test/lint-docs-catalog.py

doc/user/content/sql/system-catalog/mz_introspection.md

src/compute/src/logging/initialize.rs

src/compute/src/logging/prometheus.rs

teskje

Looks good! I have mostly nits, but we should not call the introspection source _per_worker.

doc/user/content/sql/system-catalog/mz_introspection.md

src/compute-client/src/logging.rs

src/compute-types/src/dyncfgs.rs

teskje · 2026-03-06T16:00:35Z

src/compute/src/logging/prometheus.rs

+    worker_config: Rc<ConfigSet>,
+    workers_per_process: usize,
+) -> Return {
+    let variant = LogVariant::Compute(ComputeLog::PrometheusMetrics);


I've always assumed that the LogVariant variants match the logging dataflows that produce them, so I would have expected this to be LogVariant::Prometheus. But then I realized LogVariant::Timely(Reachability) also violates my imagined rule, so I guess the rule doesn't exist.

So nothing to do for this PR, but it seems like we could consider removing the {Timely, Differential, Compute} split in LogVariant since it doesn't seem to serve any purpose.

src/compute/src/logging/prometheus.rs

test/cluster/mzcompose.py

Add a new `mz_compute_prometheus_metrics_per_worker` introspection table that exposes per-replica Prometheus metrics as queryable SQL data. A custom timely source operator periodically calls `registry.gather()`, diffs against the previous snapshot, and emits changes into an arranged trace. The scrape interval is controlled by the `compute_prometheus_scrape_interval` dyncfg (default 30s). Histograms are flattened into bucket/sum/count rows and summaries into quantile/sum/count rows, matching the Prometheus text exposition format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from CapacityContainerBuilder<Vec<_>> to ColumnBuilder for the source operator output, matching the columnar pattern used elsewhere. Pass references instead of cloning snapshot data into sessions. Use data.drain() in the unary operator to take ownership of columnar data. Fix the scrape timing to use a next_scrape instant gated by the dyncfg interval, and reschedule based on the logging interval rather than the scrape interval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `workflow_test_prometheus_metrics` to the cluster mzcompose tests. The test creates an unorchestrated single-worker replica, sets the scrape interval to 1s, and verifies that `mz_compute_prometheus_metrics_per_worker` contains rows with both counter and gauge metric types and the expected worker_id. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Only scrape prometheus metrics on one worker per process by dropping the capability for disabled workers so the frontier can advance. Fix the inverted scrape timing condition. Replace `as f64` casts with `f64::cast_lossy` to satisfy clippy. Update SLT, testdrive, and datadriven test expectations for the new introspection source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add catalog documentation for the new introspection source and register compute_prometheus_scrape_interval as uninteresting in parallel-workload and mzcompose test flag lists. Update expected index count in catalog.td. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove the separate scrape interval configuration flag and scrape metrics every time the logging operator wakes (controlled by the logging interval). Guard against spurious wakeups using the logging interval. Add RELATION_SPEC_NO_COMMENTS marker to lint-docs-catalog.py for relations whose catalog does not store column comments, such as BuiltinLog items. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Moritz Hoffmann <mh@materialize.com>

Rename `worker_id` column to `process_id` and compute it as `scope.index() / scope.peers()` to correctly reflect that metrics are per-process, not per-worker. Add `ENABLE_COMPUTE_PROMETHEUS_METRICS` dyncfg that can disable scraping at runtime. The flag is checked on each scrape cycle inside the operator closure. Refactor `RELATION_SPEC_NO_COMMENTS` into a `NO_COMMENTS` modifier on `RELATION_SPEC` in lint-docs-catalog.py. Update documentation: add summary flattening note, remove `untyped` from metric_type description, update column name and description. Expand the cluster test to use a multi-process cluster (scale=2, workers=2) and verify both process IDs appear. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Define key `(metric_name, labels, process_id)` for the PrometheusMetrics relation, matching the unique identity of a metric time series per process. Remove the unused `include_comments = True` initialization in lint-docs-catalog.py, since `include_comments` is always assigned from the modifier check before use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add `enable_compute_prometheus_metrics` to uninteresting system parameters in both mzcompose and parallel-workload to satisfy the lint-test-flags check. Note in the documentation that metrics with the legacy Prometheus type `untyped` may appear. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The previous code used `scope.index() / scope.peers()` to compute the process ID, but `scope.peers()` returns the total number of workers across all processes, not per-process. This caused process_id to always be 0 and only one worker globally to scrape metrics. Thread `workers_per_process` from `TimelyConfig::workers` through the call chain to `prometheus::construct()`, and use it to correctly compute `process_id = scope.index() / workers_per_process` and `enable = scope.index() % workers_per_process == 0`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge the "PrometheusMetrics" source operator and the "ToRow PrometheusMetrics" unary operator into a single source operator that packs Row pairs directly. This eliminates the intermediate `(SnapshotKey, SnapshotValue)` tuple representation and the label serialization round-trip (serialize to `k=v,k=v` string, then parse back), following the pattern already used by `compute.rs`. Additional optimizations: * Store label pairs as `Vec<(String, String)>` in the snapshot key instead of a serialized string, removing `parse_labels` entirely. * Take `name` and `labels` by value in `insert_row` since owned values are needed for the snapshot map. * Use `&'static str` for `metric_type` in `SnapshotValue` since type strings are always literals. * Take `MetricFamily` vec by value in `flatten_metrics` to avoid borrowing the `gather()` result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Rename relation from `mz_compute_prometheus_metrics_per_worker` to `mz_compute_prometheus_metrics` since it is per-process. * Move `process_id` to the first column for consistency with other logging relations that put identifier columns first. * Rename dyncfg to `enable_compute_prometheus_metrics_introspection` to avoid ambiguity. * Downgrade capability to the next-scrape timestamp instead of the current time, avoiding frontier churn from premature wakeups. * Retract existing data when the dyncfg is disabled instead of leaving stale rows, matching the documented behavior. * Soft-panic on unexpected untyped metrics instead of silently ingesting them. * Simplify `labels.sort_by(...)` to `labels.sort()`. * Use a managed cluster in the mzcompose test to reduce boilerplate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change `enable_compute_prometheus_metrics_introspection` (bool) to `compute_prometheus_scrape_interval` (Duration, default 1s). The operator uses `max(prom_interval, logging_interval)` as the effective scrape interval. Setting the interval to 0s disables scraping and retracts existing data. Add a test that verifies setting the interval to 0s empties the relation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update parallel_workload and mzcompose to use the renamed `compute_prometheus_scrape_interval` flag, and rewrite the introspection relations SLT to reflect the merged operator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Include "introspection" in the flag name to clarify it controls the Prometheus introspection source. Also remove the premature-wakeup early return from the operator, since it only self-activates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

antiguru marked this pull request as ready for review February 27, 2026 21:25

antiguru requested review from a team as code owners February 27, 2026 21:25

antiguru requested a review from SangJunBak February 27, 2026 21:25

antiguru commented Mar 1, 2026

View reviewed changes

src/compute/src/logging/prometheus.rs Show resolved Hide resolved

antiguru force-pushed the prom_metrics branch from b466ac7 to b358698 Compare March 1, 2026 19:45

antiguru requested review from a team as code owners March 1, 2026 20:02

antiguru requested a review from teskje March 2, 2026 10:00

antiguru commented Mar 2, 2026

View reviewed changes

src/compute-types/src/dyncfgs.rs Outdated Show resolved Hide resolved

antiguru commented Mar 2, 2026

View reviewed changes

src/compute/src/logging/prometheus.rs Outdated Show resolved Hide resolved

antiguru force-pushed the prom_metrics branch 2 times, most recently from 415f6f2 to aee8039 Compare March 3, 2026 09:06

teskje reviewed Mar 3, 2026

View reviewed changes

antiguru requested a review from teskje March 3, 2026 14:40

antiguru commented Mar 3, 2026

View reviewed changes

src/compute/src/logging/initialize.rs Outdated Show resolved Hide resolved

antiguru commented Mar 3, 2026

View reviewed changes

src/compute/src/logging/prometheus.rs Outdated Show resolved Hide resolved

antiguru force-pushed the prom_metrics branch from 881a468 to de4cd4b Compare March 3, 2026 18:44

teskje reviewed Mar 6, 2026

View reviewed changes

antiguru force-pushed the prom_metrics branch from de4cd4b to 9700150 Compare March 9, 2026 18:32

antiguru requested a review from teskje March 10, 2026 10:06

antiguru force-pushed the prom_metrics branch from 5bbc18b to e02704e Compare March 10, 2026 20:12

antiguru and others added 7 commits March 11, 2026 20:24

Fix test failure

2e2480c

Signed-off-by: Moritz Hoffmann <mh@materialize.com>

antiguru and others added 10 commits March 11, 2026 20:24

Fix CI: register renamed dyncfg and rewrite SLT

d5358c0

Update parallel_workload and mzcompose to use the renamed `compute_prometheus_scrape_interval` flag, and rewrite the introspection relations SLT to reflect the merged operator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite test expectations after rebase

5d6d5fd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

antiguru force-pushed the prom_metrics branch from f4f1fd1 to 5d6d5fd Compare March 11, 2026 19:29

Conversation

antiguru commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2026

PR title guidelines

Pre-merge checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

teskje Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

antiguru commented Feb 27, 2026 •

edited

Loading