compute: Surface prometheus metrics through introspection#35256
compute: Surface prometheus metrics through introspection#35256antiguru wants to merge 17 commits intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
415f6f2 to
aee8039
Compare
teskje
left a comment
There was a problem hiding this comment.
Only looked at the docs so far, I'll probably wait until the per-process (instead of per-worker) change is in.
I'm also wondering if we should feature flag the introspection source? Seems like it has the potential to be quite expensive, with all the strings. Especially if we include the help text in every update.
teskje
left a comment
There was a problem hiding this comment.
Looks good! I have mostly nits, but we should not call the introspection source _per_worker.
| worker_config: Rc<ConfigSet>, | ||
| workers_per_process: usize, | ||
| ) -> Return { | ||
| let variant = LogVariant::Compute(ComputeLog::PrometheusMetrics); |
There was a problem hiding this comment.
I've always assumed that the LogVariant variants match the logging dataflows that produce them, so I would have expected this to be LogVariant::Prometheus. But then I realized LogVariant::Timely(Reachability) also violates my imagined rule, so I guess the rule doesn't exist.
So nothing to do for this PR, but it seems like we could consider removing the {Timely, Differential, Compute} split in LogVariant since it doesn't seem to serve any purpose.
Add a new `mz_compute_prometheus_metrics_per_worker` introspection table that exposes per-replica Prometheus metrics as queryable SQL data. A custom timely source operator periodically calls `registry.gather()`, diffs against the previous snapshot, and emits changes into an arranged trace. The scrape interval is controlled by the `compute_prometheus_scrape_interval` dyncfg (default 30s). Histograms are flattened into bucket/sum/count rows and summaries into quantile/sum/count rows, matching the Prometheus text exposition format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from CapacityContainerBuilder<Vec<_>> to ColumnBuilder for the source operator output, matching the columnar pattern used elsewhere. Pass references instead of cloning snapshot data into sessions. Use data.drain() in the unary operator to take ownership of columnar data. Fix the scrape timing to use a next_scrape instant gated by the dyncfg interval, and reschedule based on the logging interval rather than the scrape interval. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `workflow_test_prometheus_metrics` to the cluster mzcompose tests. The test creates an unorchestrated single-worker replica, sets the scrape interval to 1s, and verifies that `mz_compute_prometheus_metrics_per_worker` contains rows with both counter and gauge metric types and the expected worker_id. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only scrape prometheus metrics on one worker per process by dropping the capability for disabled workers so the frontier can advance. Fix the inverted scrape timing condition. Replace `as f64` casts with `f64::cast_lossy` to satisfy clippy. Update SLT, testdrive, and datadriven test expectations for the new introspection source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add catalog documentation for the new introspection source and register compute_prometheus_scrape_interval as uninteresting in parallel-workload and mzcompose test flag lists. Update expected index count in catalog.td. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the separate scrape interval configuration flag and scrape metrics every time the logging operator wakes (controlled by the logging interval). Guard against spurious wakeups using the logging interval. Add RELATION_SPEC_NO_COMMENTS marker to lint-docs-catalog.py for relations whose catalog does not store column comments, such as BuiltinLog items. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Moritz Hoffmann <mh@materialize.com>
Rename `worker_id` column to `process_id` and compute it as `scope.index() / scope.peers()` to correctly reflect that metrics are per-process, not per-worker. Add `ENABLE_COMPUTE_PROMETHEUS_METRICS` dyncfg that can disable scraping at runtime. The flag is checked on each scrape cycle inside the operator closure. Refactor `RELATION_SPEC_NO_COMMENTS` into a `NO_COMMENTS` modifier on `RELATION_SPEC` in lint-docs-catalog.py. Update documentation: add summary flattening note, remove `untyped` from metric_type description, update column name and description. Expand the cluster test to use a multi-process cluster (scale=2, workers=2) and verify both process IDs appear. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Define key `(metric_name, labels, process_id)` for the PrometheusMetrics relation, matching the unique identity of a metric time series per process. Remove the unused `include_comments = True` initialization in lint-docs-catalog.py, since `include_comments` is always assigned from the modifier check before use. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add `enable_compute_prometheus_metrics` to uninteresting system parameters in both mzcompose and parallel-workload to satisfy the lint-test-flags check. Note in the documentation that metrics with the legacy Prometheus type `untyped` may appear. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous code used `scope.index() / scope.peers()` to compute the process ID, but `scope.peers()` returns the total number of workers across all processes, not per-process. This caused process_id to always be 0 and only one worker globally to scrape metrics. Thread `workers_per_process` from `TimelyConfig::workers` through the call chain to `prometheus::construct()`, and use it to correctly compute `process_id = scope.index() / workers_per_process` and `enable = scope.index() % workers_per_process == 0`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Merge the "PrometheusMetrics" source operator and the "ToRow PrometheusMetrics" unary operator into a single source operator that packs Row pairs directly. This eliminates the intermediate `(SnapshotKey, SnapshotValue)` tuple representation and the label serialization round-trip (serialize to `k=v,k=v` string, then parse back), following the pattern already used by `compute.rs`. Additional optimizations: * Store label pairs as `Vec<(String, String)>` in the snapshot key instead of a serialized string, removing `parse_labels` entirely. * Take `name` and `labels` by value in `insert_row` since owned values are needed for the snapshot map. * Use `&'static str` for `metric_type` in `SnapshotValue` since type strings are always literals. * Take `MetricFamily` vec by value in `flatten_metrics` to avoid borrowing the `gather()` result. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Rename relation from `mz_compute_prometheus_metrics_per_worker` to `mz_compute_prometheus_metrics` since it is per-process. * Move `process_id` to the first column for consistency with other logging relations that put identifier columns first. * Rename dyncfg to `enable_compute_prometheus_metrics_introspection` to avoid ambiguity. * Downgrade capability to the next-scrape timestamp instead of the current time, avoiding frontier churn from premature wakeups. * Retract existing data when the dyncfg is disabled instead of leaving stale rows, matching the documented behavior. * Soft-panic on unexpected untyped metrics instead of silently ingesting them. * Simplify `labels.sort_by(...)` to `labels.sort()`. * Use a managed cluster in the mzcompose test to reduce boilerplate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `enable_compute_prometheus_metrics_introspection` (bool) to `compute_prometheus_scrape_interval` (Duration, default 1s). The operator uses `max(prom_interval, logging_interval)` as the effective scrape interval. Setting the interval to 0s disables scraping and retracts existing data. Add a test that verifies setting the interval to 0s empties the relation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update parallel_workload and mzcompose to use the renamed `compute_prometheus_scrape_interval` flag, and rewrite the introspection relations SLT to reflect the merged operator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Include "introspection" in the flag name to clarify it controls the Prometheus introspection source. Also remove the premature-wakeup early return from the operator, since it only self-activates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add
mz_compute_prometheus_metrics_per_workerto themz_introspectionschema, exposing per-replica Prometheus metrics as queryable SQL data.A custom timely source operator periodically calls
registry.gather(), diffs against the previous snapshot, and emits changes into an arranged trace.The scrape interval is controlled by the
compute_prometheus_scrape_intervaldyncfg (default 30s).Histograms are flattened into bucket/sum/count rows and summaries into quantile/sum/count rows, matching the Prometheus text exposition format.
Schema:
metric_name text, metric_type text, labels map[text=>text], value float8, help text, worker_id uint8.A cluster mzcompose test (
workflow_test_prometheus_metrics) validates that the source populates with counter and gauge metrics on a single-worker replica with the scrape interval lowered to 1s.