25.8 Antalya backport of #90825: Add role-based access to Glue catalog by zvonand · Pull Request #1428 · Altinity/ClickHouse

zvonand · 2026-02-18T15:50:10Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add role-based access to Glue catalog. Use settings aws_role_arn and, optionally, aws_role_session_name. (ClickHouse#90825 by @antonio2368)

CI/CD Options

Exclude tests:

Regression jobs to run:

Add role-based access to Glue catalog

github-actions · 2026-02-18T15:59:02Z

Workflow [PR], commit [6a98e01]

Selfeer · 2026-02-20T16:42:23Z

Integration Test Failure

All 12 test_database_glue tests fail during fixture setup, before any test logic runs. The 1 other failure (test_storage_s3_queue::test_list_and_delete_race) is unrelated and flaky.

Root Cause

The new run_s3_mocks() function added in this PR passes a 4-tuple to start_mock_servers():

start_mock_servers(
    started_cluster, script_dir,
    [("mock_sts.py", "sts.us-east-1.amazonaws.com", "80", args)],  # 4 elements
)

But helpers/mock_servers.py on the antalya-25.8 branch only supports 3-tuples:

for server_name, container, port in mocks:  # ValueError: too many values to unpack

The upstream repo has an updated mock_servers.py that accepts the 4th args element, but that change was not included in the backport.

Traceback

test_database_glue/test.py:280 → run_s3_mocks(cluster)
test_database_glue/test.py:39  → start_mock_servers(...)
helpers/mock_servers.py:23     → ValueError: too many values to unpack (expected 3)

Impact

This is directly caused by this PR. Since the fixture crashes, all 12 test_database_glue tests (both existing and the new test_sts_smoke) fail without executing.

zvonand · 2026-02-20T16:45:47Z

@Selfeer i saw that, Ill fix that

The problem is regression fails

Selfeer · 2026-02-20T17:13:02Z

Regression aarch64 swarms - Failure Analysis

PR: #1428 | Workflow: 22222536503 | Version: 25.8.16.20001.altinityantalya | Arch: aarch64

Reason: Not related to this specific PR

Report: report.html

Summary

The Regression aarch64 swarms job failed with 6 failed scenarios out of 20 (14 OK). All 6 failures are in the node failure feature.

Failed Scenarios & Historical Flakiness

Scenario	Pass Rate	Fail	OK	Flaky?
`node failure/network failure`	79.9%	80	318	Yes (highest flakiness)
`node failure/check restart swarm node`	80.2%	77	319	Yes
`node failure/check restart clickhouse on swarm node`	85.9%	52	342	Yes
`node failure/swarm out of disk space`	91.7%	30	363	Yes
`node failure/cpu overload`	91.9%	29	364	Yes
`node failure/initiator out of disk space`	95.7%	14	379	Yes (least flaky)

Root Cause Analysis

All 6 failures originate from swarms/tests/node_failure.py and share a common error pattern:

Primary failure (4 scenarios): Test expects DB::Exception: Query was cancelled. but receives DB::Exception: Query is killed in pending state. (QUERY_WAS_CANCELLED) — an assertion message mismatch where the error code 394 is correct but the message text differs.
Secondary failure (2 scenarios — disk space): check_preprocessed_config_is_updated() assertion fails (exitcode == 1 instead of 0), and subsequent query gets ALL_CONNECTION_TRIES_FAILED (Code: 279) — a timing/infrastructure issue where ClickHouse config reload doesn't complete in time.

Conclusion

Not a regression. These failures have a long history across multiple PRs and versions (25.8.14, 25.8.16).

Appendix: Queries Used

All queries below were run against the gh-data database at github-checks.tenant-a.staging.altinity.cloud:8443.

1. Historical pass/fail ratios per scenario (all-time)

This query computes the historical pass/fail percentage for each of the failed scenarios across all recorded runs.

SELECT
    test_name,
    result,
    count() AS cnt,
    round(count() * 100.0 / sum(count()) OVER (PARTITION BY test_name), 1) AS pct
FROM `gh-data`.clickhouse_regression_results
WHERE test_name IN (
    '/swarms/feature/node failure/check restart clickhouse on swarm node',
    '/swarms/feature/node failure/check restart swarm node',
    '/swarms/feature/node failure/cpu overload',
    '/swarms/feature/node failure/initiator out of disk space',
    '/swarms/feature/node failure/network failure',
    '/swarms/feature/node failure/swarm out of disk space',
    '/swarms/feature/swarm joins',
    '/swarms/feature/swarm union'
)
GROUP BY test_name, result
ORDER BY test_name, result;

Result:

test_name	result	cnt	pct
`node failure/check restart clickhouse on swarm node`	Error	2	0.5%
`node failure/check restart clickhouse on swarm node`	Fail	52	13.1%
`node failure/check restart clickhouse on swarm node`	OK	342	85.9%
`node failure/check restart clickhouse on swarm node`	Skip	2	0.5%
`node failure/check restart swarm node`	Fail	77	19.3%
`node failure/check restart swarm node`	OK	319	80.2%
`node failure/check restart swarm node`	Skip	2	0.5%
`node failure/cpu overload`	Error	1	0.3%
`node failure/cpu overload`	Fail	29	7.3%
`node failure/cpu overload`	OK	364	91.9%
`node failure/cpu overload`	Skip	2	0.5%
`node failure/initiator out of disk space`	Error	1	0.3%
`node failure/initiator out of disk space`	Fail	14	3.5%
`node failure/initiator out of disk space`	OK	379	95.7%
`node failure/initiator out of disk space`	Skip	2	0.5%
`node failure/network failure`	Fail	80	20.1%
`node failure/network failure`	OK	318	79.9%
`node failure/swarm out of disk space`	Error	1	0.3%
`node failure/swarm out of disk space`	Fail	30	7.6%
`node failure/swarm out of disk space`	OK	363	91.7%
`node failure/swarm out of disk space`	Skip	2	0.5%
`swarm joins`	Error	6	1.4%
`swarm joins`	Fail	151	35.1%
`swarm joins`	OK	270	62.8%
`swarm joins`	Skip	3	0.7%
`swarm union`	Error	2	0.5%
`swarm union`	Fail	23	5.6%
`swarm union`	OK	385	93.2%
`swarm union`	Skip	3	0.7%

2. Recent timeline for disk space scenarios

This query shows the recent pass/fail timeline for the initiator out of disk space and swarm out of disk space scenarios, to confirm that failures are intermittent and not trending.

SELECT
    test_name,
    result,
    start_time,
    clickhouse_version,
    job_url
FROM `gh-data`.clickhouse_regression_results
WHERE test_name LIKE '%swarm%'
  AND (
    test_name LIKE '%kill swarm%'
    OR test_name LIKE '%out of disk%'
    OR test_name LIKE '%initiator%'
    OR test_name LIKE '%kill initiator%'
    OR test_name LIKE '%node failure%feature%'
  )
ORDER BY start_time DESC
LIMIT 200;

Key observation: The most recent ~30 entries for both initiator out of disk space and swarm out of disk space show predominantly OK results, with only sporadic Fail entries, confirming the flaky nature rather than a consistent regression.

Selfeer · 2026-02-20T17:15:10Z

'/tiered storage/with s3gcs/background move/max move factor': https://github.com/Altinity/ClickHouse/actions/runs/22222536503/job/64284577377

Also is not related, and seems like an infra fail - The test queries system.part_log for path_on_disk of moved parts, and the output contains an empty line (between paths of a previous table and
the current table). The test iterates over all lines and asserts each starts with /var/lib/clickhouse/disks/external/ - the empty string '' fails that assertion. This is a timing/cleanup artifact, not a logic error.

Selfeer · 2026-02-26T08:48:19Z

swarms fail: https://altinity-build-artifacts.s3.amazonaws.com/REFs/1428/merge/6a98e013d2bec58de43a4daff7f5f5838fccb7c9/regression/aarch64/with_analyzer/zookeeper/without_thread_fuzzer/swarms/report.html

There is a swarms fail that doesn't seem to be related to the changes in the pr, in fact those are the same fails that are described here: #1428 (comment)

@alsugiliazova can we mark this as verified regardless of the issue that was raised? ClickHouse#97858

zvonand · 2026-02-26T08:56:21Z

I think it is OK (as long as this behavior matches upstream and they are OK with that)

Also, does not look like a problem to me at all.

zvonand · 2026-02-26T10:43:00Z

This PR does not have S3 credentials cache (It was added in 25.11 and 25.12 and not in the scope of this backport), so, performance may differ from latest upstream.

Backporting the cache logic would be quite heavy.

Actually, the code in this PR is somewhat similar to what they had in upstream before 91706 (which is related, but its backport was not requested)

alsugiliazova · 2026-02-26T10:59:34Z

25.8 does not include CredentialsProviderCache, so the behavior described in this issue does not apply to this backport.

Permission changes: They take effect after recreating the database (DROP + CREATE). The new database gets a new catalog and a new credential provider, so the next Glue/S3 use triggers a fresh AssumeRole and sees the current role policies.
Trade-off: The benefits of CredentialsProviderCache (e.g. reusing the same provider/STS session across databases with the same role, fewer AssumeRole calls) are not present in this backport.

Selfeer · 2026-03-06T14:13:33Z

Deep Audit Report: PR #1428 - Glue role-based access backport

PR: 25.8 Antalya backport of #90825: Add role-based access to Glue catalog
Audit mode: Static deep audit (no runtime execution), focused on changed-code call paths and test adequacy.

1. Scope and partitions

Partition A - Settings propagation: DatabaseDataLakeSettings, DataLakeStorageSettings, CatalogSettings, DatabaseDataLake::getCatalog, DataLakeConfiguration::getCatalog.
Partition B - Glue credential provider wiring: GlueCatalog constructor and GlueClient creation path.
Partition C - S3 credential handoff to metadata reads: GlueCatalog::setCredentials and metadata-resolution path.
Partition D - Integration coverage: tests/integration/test_database_glue/test.py, s3_mocks/mock_sts.py, compose fixture updates.
Partition E - Build/backport mechanics: contrib/curl-cmake/CMakeLists.txt source list updates.

2. Call graph

Entrypoints

CREATE DATABASE ... ENGINE=DataLakeCatalog(...) SETTINGS ...
Table-engine path via DataLakeConfiguration::getCatalog(...)
Query-time table metadata fetch and object read through Glue + S3 credentials

Credential path

Settings parse:
- DB-level: aws_role_arn, aws_role_session_name
- Table-engine-level: storage_aws_role_arn, storage_aws_role_session_name
Wiring into CatalogSettings
GlueCatalog constructor builds S3CredentialsProviderChain
If role is provided, wrap chain with AwsAuthSTSAssumeRoleCredentialsProvider
Use provider for:
- Glue API client auth
- S3 credentials injected into Iceberg metadata/object reads

Key behavior change

Static AWSCredentials field was replaced with provider-based credential retrieval, so temporary/session credentials can propagate to storage operations.

3. Transition matrix

Transition	Entry	Processing	Side effect	Invariant
Settings -> catalog config	DB/table settings	Map role fields into `CatalogSettings`	Role config reaches catalog ctor	No role fields lost
Catalog ctor -> provider chain	`GlueCatalog::GlueCatalog`	Build chain, optional STS assume-role wrapper	Runtime provider created	With role ARN, STS provider is used
Provider -> Glue requests	Glue API operations	`GlueClient` uses provider	Authenticated Glue RPCs	Credentials must be non-empty/refreshable
Provider -> S3 metadata reads	`setCredentials` + metadata fetch	Pull current credentials from provider	S3 reads use same auth context	No stale static credentials after rotation
Integration fixture startup	`started_cluster` fixture	Start STS mock + run tests	CI signal	Test should validate both role ARN and role session behavior

4. Logical testing summary

PR adds a new integration smoke test that validates STS session-name-sensitive behavior (negative + positive path).
Fixture and compose adjustments align object storage access with credential-based control instead of public bucket policy.
Static review confirms role settings are propagated for both database and table-engine paths.
No direct code path found that bypasses credential provider usage once GlueCatalog is instantiated.

5. Findings (by severity)

Medium - Integration test does not validate `aws_role_arn` semantics

Type: Test coverage defect (can hide functional regressions).
Where: tests/integration/test_database_glue/s3_mocks/mock_sts.py, tests/integration/test_database_glue/test.py
What: The STS mock determines success by checking RoleSessionName in the request URL and does not assert RoleArn value. The new smoke test sets aws_role_arn, but success/failure is effectively controlled by session name only.
Impact: A regression where role ARN handling is broken (or ignored) could still pass CI, reducing confidence in the advertised "role-based access" behavior.
Minimal repro idea: Keep aws_role_session_name correct and deliberately break role ARN propagation in code; current test is likely to remain green.
Fix direction: Extend mock_sts.py to parse and validate RoleArn plus RoleSessionName, and add a negative test for wrong role ARN.

Evidence

if len(sys.argv) >= 3:
    expected_role = sys.argv[2]
else:
    expected_role = 'miniorole'

@route("/", method="POST")
def sts():
    access_key = "minio"
    secret_access_key = "wrong_key"

    if f"RoleSessionName={expected_role}" in str(request.url):
        secret_access_key = "ClickHouse_Minio_P@ssw0rd"

    create_clickhouse_glue_database(
        started_cluster,
        node,
        db_name_success,
        additional_settings={
            "aws_role_arn": "arn::role",
            "aws_role_session_name": "miniorole",
        },
        with_credentials=False,
    )

    # Query should succeed with correct role_session_name
    result = node.query(f"SELECT sum(value) FROM {db_name_success}.`{root_namespace}.{table_name}`")

High/Low code defects

High: none confirmed in merged PR state.
Low: none confirmed in merged PR state.

6. Coverage accounting and stop condition

Reviewed all changed production files listed in PR metadata and their direct call paths.
Reviewed integration-test and fixture deltas tied to new role settings.
Included PR discussion context around early integration failures and final merged behavior.
Stop condition reached for static audit scope (changed files + directly connected runtime/auth path).

7. Assumptions and limits

Static analysis only; no local integration execution.
Audit targets merged PR state (not intermediate failing revisions during review iteration).
External AWS behavior and long-lived credential rotation were not runtime-validated.

8. Confidence

Overall confidence: Medium-High.
Why: setting propagation and provider wiring are explicit and traceable; remaining uncertainty is runtime/auth-environment dependent.
Confidence raiser: add a strict ARN+session STS mock contract test and run the existing Glue integration suite with that assertion enabled.

9. Residual risks

Runtime auth failures may still appear only under specific IAM/policy environments not replicated by current mocks.
Lack of role ARN validation in test doubles can allow subtle authorization regressions to pass.

Merge pull request ClickHouse#90825 from ClickHouse/glue-iam-auth

5577784

Add role-based access to Glue catalog

zvonand added antalya backport Backport antalya-25.8 antalya-25.8.16 antalya-25.8.16.20001 labels Feb 18, 2026

zvonand and others added 3 commits February 18, 2026 19:47

fix build

98bc08b

fix (2)

3347841

Merge branch 'antalya-25.8' into backports/antalya-25.8/90825

288e896

zvonand force-pushed the backports/antalya-25.8/90825 branch from cef9232 to fba0893 Compare February 20, 2026 09:54

fix build (3)

c96e9ea

zvonand force-pushed the backports/antalya-25.8/90825 branch from fba0893 to c96e9ea Compare February 20, 2026 10:05

fix nullptr access

e76cf14

zvonand and others added 3 commits February 20, 2026 18:38

fix integration

d81516c

don't beat me, this is not as bad as it looks

d461b2c

Merge branch 'antalya-25.8' into backports/antalya-25.8/90825

6a98e01

alsugiliazova added the verified Verified by QA label Feb 26, 2026

zvonand merged commit 801ee9e into antalya-25.8 Feb 26, 2026
442 of 444 checks passed

svb-alt added antalya-25.8.16.20002 and removed antalya-25.8.16.20001 labels Mar 3, 2026

CarlosFelipeOR mentioned this pull request Mar 5, 2026

Antalya 26.1 - Forward port of list objects cache #1040 #1405

Merged

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428
zvonand merged 9 commits intoantalya-25.8from
backports/antalya-25.8/90825

zvonand commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 26, 2026

Uh oh!

zvonand commented Feb 26, 2026 •

edited

Loading

Uh oh!

zvonand commented Feb 26, 2026 •

edited

Loading

Uh oh!

alsugiliazova commented Feb 26, 2026

Uh oh!

Uh oh!

Selfeer commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zvonand commented Feb 18, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Selfeer commented Feb 20, 2026

Integration Test Failure

Root Cause

Traceback

Impact

Uh oh!

zvonand commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 20, 2026

Regression aarch64 swarms - Failure Analysis

Summary

Failed Scenarios & Historical Flakiness

Root Cause Analysis

Conclusion

Appendix: Queries Used

1. Historical pass/fail ratios per scenario (all-time)

2. Recent timeline for disk space scenarios

Uh oh!

Selfeer commented Feb 20, 2026

Uh oh!

Selfeer commented Feb 26, 2026

Uh oh!

zvonand commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alsugiliazova commented Feb 26, 2026

Uh oh!

Uh oh!

Selfeer commented Mar 6, 2026

Deep Audit Report: PR #1428 - Glue role-based access backport

1. Scope and partitions

2. Call graph

3. Transition matrix

4. Logical testing summary

5. Findings (by severity)

Medium - Integration test does not validate aws_role_arn semantics

High/Low code defects

6. Coverage accounting and stop condition

7. Assumptions and limits

8. Confidence

9. Residual risks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Feb 18, 2026 •

edited

Loading

zvonand commented Feb 26, 2026 •

edited

Loading

zvonand commented Feb 26, 2026 •

edited

Loading

Medium - Integration test does not validate `aws_role_arn` semantics