Skip to content

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428

Merged
zvonand merged 9 commits intoantalya-25.8from
backports/antalya-25.8/90825
Feb 26, 2026
Merged

25.8 Antalya backport of #90825: Add role-based access to Glue catalog#1428
zvonand merged 9 commits intoantalya-25.8from
backports/antalya-25.8/90825

Conversation

@zvonand
Copy link
Collaborator

@zvonand zvonand commented Feb 18, 2026

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add role-based access to Glue catalog. Use settings aws_role_arn and, optionally, aws_role_session_name. (ClickHouse#90825 by @antonio2368)

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Feb 18, 2026

Workflow [PR], commit [6a98e01]

@zvonand zvonand force-pushed the backports/antalya-25.8/90825 branch from cef9232 to fba0893 Compare February 20, 2026 09:54
@zvonand zvonand force-pushed the backports/antalya-25.8/90825 branch from fba0893 to c96e9ea Compare February 20, 2026 10:05
@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

Integration Test Failure

All 12 test_database_glue tests fail during fixture setup, before any test logic runs. The 1 other failure (test_storage_s3_queue::test_list_and_delete_race) is unrelated and flaky.

Root Cause

The new run_s3_mocks() function added in this PR passes a 4-tuple to start_mock_servers():

start_mock_servers(
    started_cluster, script_dir,
    [("mock_sts.py", "sts.us-east-1.amazonaws.com", "80", args)],  # 4 elements
)

But helpers/mock_servers.py on the antalya-25.8 branch only supports 3-tuples:

for server_name, container, port in mocks:  # ValueError: too many values to unpack

The upstream repo has an updated mock_servers.py that accepts the 4th args element, but that change was not included in the backport.

Traceback

test_database_glue/test.py:280 → run_s3_mocks(cluster)
test_database_glue/test.py:39  → start_mock_servers(...)
helpers/mock_servers.py:23     → ValueError: too many values to unpack (expected 3)

Impact

This is directly caused by this PR. Since the fixture crashes, all 12 test_database_glue tests (both existing and the new test_sts_smoke) fail without executing.

@zvonand
Copy link
Collaborator Author

zvonand commented Feb 20, 2026

@Selfeer i saw that, Ill fix that

The problem is regression fails

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

Regression aarch64 swarms - Failure Analysis

PR: #1428 | Workflow: 22222536503 | Version: 25.8.16.20001.altinityantalya | Arch: aarch64

Reason: Not related to this specific PR

Report: report.html

Summary

The Regression aarch64 swarms job failed with 6 failed scenarios out of 20 (14 OK). All 6 failures are in the node failure feature.

Failed Scenarios & Historical Flakiness

Scenario Pass Rate Fail OK Flaky?
node failure/network failure 79.9% 80 318 Yes (highest flakiness)
node failure/check restart swarm node 80.2% 77 319 Yes
node failure/check restart clickhouse on swarm node 85.9% 52 342 Yes
node failure/swarm out of disk space 91.7% 30 363 Yes
node failure/cpu overload 91.9% 29 364 Yes
node failure/initiator out of disk space 95.7% 14 379 Yes (least flaky)

Root Cause Analysis

All 6 failures originate from swarms/tests/node_failure.py and share a common error pattern:

  1. Primary failure (4 scenarios): Test expects DB::Exception: Query was cancelled. but receives DB::Exception: Query is killed in pending state. (QUERY_WAS_CANCELLED) — an assertion message mismatch where the error code 394 is correct but the message text differs.

  2. Secondary failure (2 scenarios — disk space): check_preprocessed_config_is_updated() assertion fails (exitcode == 1 instead of 0), and subsequent query gets ALL_CONNECTION_TRIES_FAILED (Code: 279) — a timing/infrastructure issue where ClickHouse config reload doesn't complete in time.

Conclusion

  • Not a regression. These failures have a long history across multiple PRs and versions (25.8.14, 25.8.16).

Appendix: Queries Used

All queries below were run against the gh-data database at github-checks.tenant-a.staging.altinity.cloud:8443.

1. Historical pass/fail ratios per scenario (all-time)

This query computes the historical pass/fail percentage for each of the failed scenarios across all recorded runs.

SELECT
    test_name,
    result,
    count() AS cnt,
    round(count() * 100.0 / sum(count()) OVER (PARTITION BY test_name), 1) AS pct
FROM `gh-data`.clickhouse_regression_results
WHERE test_name IN (
    '/swarms/feature/node failure/check restart clickhouse on swarm node',
    '/swarms/feature/node failure/check restart swarm node',
    '/swarms/feature/node failure/cpu overload',
    '/swarms/feature/node failure/initiator out of disk space',
    '/swarms/feature/node failure/network failure',
    '/swarms/feature/node failure/swarm out of disk space',
    '/swarms/feature/swarm joins',
    '/swarms/feature/swarm union'
)
GROUP BY test_name, result
ORDER BY test_name, result;

Result:

test_name result cnt pct
node failure/check restart clickhouse on swarm node Error 2 0.5%
node failure/check restart clickhouse on swarm node Fail 52 13.1%
node failure/check restart clickhouse on swarm node OK 342 85.9%
node failure/check restart clickhouse on swarm node Skip 2 0.5%
node failure/check restart swarm node Fail 77 19.3%
node failure/check restart swarm node OK 319 80.2%
node failure/check restart swarm node Skip 2 0.5%
node failure/cpu overload Error 1 0.3%
node failure/cpu overload Fail 29 7.3%
node failure/cpu overload OK 364 91.9%
node failure/cpu overload Skip 2 0.5%
node failure/initiator out of disk space Error 1 0.3%
node failure/initiator out of disk space Fail 14 3.5%
node failure/initiator out of disk space OK 379 95.7%
node failure/initiator out of disk space Skip 2 0.5%
node failure/network failure Fail 80 20.1%
node failure/network failure OK 318 79.9%
node failure/swarm out of disk space Error 1 0.3%
node failure/swarm out of disk space Fail 30 7.6%
node failure/swarm out of disk space OK 363 91.7%
node failure/swarm out of disk space Skip 2 0.5%
swarm joins Error 6 1.4%
swarm joins Fail 151 35.1%
swarm joins OK 270 62.8%
swarm joins Skip 3 0.7%
swarm union Error 2 0.5%
swarm union Fail 23 5.6%
swarm union OK 385 93.2%
swarm union Skip 3 0.7%

2. Recent timeline for disk space scenarios

This query shows the recent pass/fail timeline for the initiator out of disk space and swarm out of disk space scenarios, to confirm that failures are intermittent and not trending.

SELECT
    test_name,
    result,
    start_time,
    clickhouse_version,
    job_url
FROM `gh-data`.clickhouse_regression_results
WHERE test_name LIKE '%swarm%'
  AND (
    test_name LIKE '%kill swarm%'
    OR test_name LIKE '%out of disk%'
    OR test_name LIKE '%initiator%'
    OR test_name LIKE '%kill initiator%'
    OR test_name LIKE '%node failure%feature%'
  )
ORDER BY start_time DESC
LIMIT 200;

Key observation: The most recent ~30 entries for both initiator out of disk space and swarm out of disk space show predominantly OK results, with only sporadic Fail entries, confirming the flaky nature rather than a consistent regression.

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 20, 2026

'/tiered storage/with s3gcs/background move/max move factor': https://github.com/Altinity/ClickHouse/actions/runs/22222536503/job/64284577377

Also is not related, and seems like an infra fail - The test queries system.part_log for path_on_disk of moved parts, and the output contains an empty line (between paths of a previous table and
the current table). The test iterates over all lines and asserts each starts with /var/lib/clickhouse/disks/external/ - the empty string '' fails that assertion. This is a timing/cleanup artifact, not a logic error.

@Selfeer
Copy link
Collaborator

Selfeer commented Feb 26, 2026

swarms fail: https://altinity-build-artifacts.s3.amazonaws.com/REFs/1428/merge/6a98e013d2bec58de43a4daff7f5f5838fccb7c9/regression/aarch64/with_analyzer/zookeeper/without_thread_fuzzer/swarms/report.html

There is a swarms fail that doesn't seem to be related to the changes in the pr, in fact those are the same fails that are described here: #1428 (comment)

@alsugiliazova can we mark this as verified regardless of the issue that was raised? ClickHouse#97858

@zvonand
Copy link
Collaborator Author

zvonand commented Feb 26, 2026

I think it is OK (as long as this behavior matches upstream and they are OK with that)

Also, does not look like a problem to me at all.

@zvonand
Copy link
Collaborator Author

zvonand commented Feb 26, 2026

This PR does not have S3 credentials cache (It was added in 25.11 and 25.12 and not in the scope of this backport), so, performance may differ from latest upstream.

Backporting the cache logic would be quite heavy.

Actually, the code in this PR is somewhat similar to what they had in upstream before 91706 (which is related, but its backport was not requested)

@alsugiliazova
Copy link
Member

25.8 does not include CredentialsProviderCache, so the behavior described in this issue does not apply to this backport.

  • Permission changes: They take effect after recreating the database (DROP + CREATE). The new database gets a new catalog and a new credential provider, so the next Glue/S3 use triggers a fresh AssumeRole and sees the current role policies.
  • Trade-off: The benefits of CredentialsProviderCache (e.g. reusing the same provider/STS session across databases with the same role, fewer AssumeRole calls) are not present in this backport.

@alsugiliazova alsugiliazova added the verified Verified by QA label Feb 26, 2026
@zvonand zvonand merged commit 801ee9e into antalya-25.8 Feb 26, 2026
442 of 444 checks passed
@Selfeer
Copy link
Collaborator

Selfeer commented Mar 6, 2026

Deep Audit Report: PR #1428 - Glue role-based access backport

PR: 25.8 Antalya backport of #90825: Add role-based access to Glue catalog
Audit mode: Static deep audit (no runtime execution), focused on changed-code call paths and test adequacy.

1. Scope and partitions

  • Partition A - Settings propagation: DatabaseDataLakeSettings, DataLakeStorageSettings, CatalogSettings, DatabaseDataLake::getCatalog, DataLakeConfiguration::getCatalog.
  • Partition B - Glue credential provider wiring: GlueCatalog constructor and GlueClient creation path.
  • Partition C - S3 credential handoff to metadata reads: GlueCatalog::setCredentials and metadata-resolution path.
  • Partition D - Integration coverage: tests/integration/test_database_glue/test.py, s3_mocks/mock_sts.py, compose fixture updates.
  • Partition E - Build/backport mechanics: contrib/curl-cmake/CMakeLists.txt source list updates.

2. Call graph

Entrypoints

  • CREATE DATABASE ... ENGINE=DataLakeCatalog(...) SETTINGS ...
  • Table-engine path via DataLakeConfiguration::getCatalog(...)
  • Query-time table metadata fetch and object read through Glue + S3 credentials

Credential path

  1. Settings parse:
    • DB-level: aws_role_arn, aws_role_session_name
    • Table-engine-level: storage_aws_role_arn, storage_aws_role_session_name
  2. Wiring into CatalogSettings
  3. GlueCatalog constructor builds S3CredentialsProviderChain
  4. If role is provided, wrap chain with AwsAuthSTSAssumeRoleCredentialsProvider
  5. Use provider for:
    • Glue API client auth
    • S3 credentials injected into Iceberg metadata/object reads

Key behavior change

  • Static AWSCredentials field was replaced with provider-based credential retrieval, so temporary/session credentials can propagate to storage operations.

3. Transition matrix

Transition Entry Processing Side effect Invariant
Settings -> catalog config DB/table settings Map role fields into CatalogSettings Role config reaches catalog ctor No role fields lost
Catalog ctor -> provider chain GlueCatalog::GlueCatalog Build chain, optional STS assume-role wrapper Runtime provider created With role ARN, STS provider is used
Provider -> Glue requests Glue API operations GlueClient uses provider Authenticated Glue RPCs Credentials must be non-empty/refreshable
Provider -> S3 metadata reads setCredentials + metadata fetch Pull current credentials from provider S3 reads use same auth context No stale static credentials after rotation
Integration fixture startup started_cluster fixture Start STS mock + run tests CI signal Test should validate both role ARN and role session behavior

4. Logical testing summary

  • PR adds a new integration smoke test that validates STS session-name-sensitive behavior (negative + positive path).
  • Fixture and compose adjustments align object storage access with credential-based control instead of public bucket policy.
  • Static review confirms role settings are propagated for both database and table-engine paths.
  • No direct code path found that bypasses credential provider usage once GlueCatalog is instantiated.

5. Findings (by severity)

Medium - Integration test does not validate aws_role_arn semantics

  • Type: Test coverage defect (can hide functional regressions).
  • Where: tests/integration/test_database_glue/s3_mocks/mock_sts.py, tests/integration/test_database_glue/test.py
  • What: The STS mock determines success by checking RoleSessionName in the request URL and does not assert RoleArn value. The new smoke test sets aws_role_arn, but success/failure is effectively controlled by session name only.
  • Impact: A regression where role ARN handling is broken (or ignored) could still pass CI, reducing confidence in the advertised "role-based access" behavior.
  • Minimal repro idea: Keep aws_role_session_name correct and deliberately break role ARN propagation in code; current test is likely to remain green.
  • Fix direction: Extend mock_sts.py to parse and validate RoleArn plus RoleSessionName, and add a negative test for wrong role ARN.

Evidence

if len(sys.argv) >= 3:
    expected_role = sys.argv[2]
else:
    expected_role = 'miniorole'

@route("/", method="POST")
def sts():
    access_key = "minio"
    secret_access_key = "wrong_key"

    if f"RoleSessionName={expected_role}" in str(request.url):
        secret_access_key = "ClickHouse_Minio_P@ssw0rd"
    create_clickhouse_glue_database(
        started_cluster,
        node,
        db_name_success,
        additional_settings={
            "aws_role_arn": "arn::role",
            "aws_role_session_name": "miniorole",
        },
        with_credentials=False,
    )

    # Query should succeed with correct role_session_name
    result = node.query(f"SELECT sum(value) FROM {db_name_success}.`{root_namespace}.{table_name}`")

High/Low code defects

  • High: none confirmed in merged PR state.
  • Low: none confirmed in merged PR state.

6. Coverage accounting and stop condition

  • Reviewed all changed production files listed in PR metadata and their direct call paths.
  • Reviewed integration-test and fixture deltas tied to new role settings.
  • Included PR discussion context around early integration failures and final merged behavior.
  • Stop condition reached for static audit scope (changed files + directly connected runtime/auth path).

7. Assumptions and limits

  • Static analysis only; no local integration execution.
  • Audit targets merged PR state (not intermediate failing revisions during review iteration).
  • External AWS behavior and long-lived credential rotation were not runtime-validated.

8. Confidence

  • Overall confidence: Medium-High.
  • Why: setting propagation and provider wiring are explicit and traceable; remaining uncertainty is runtime/auth-environment dependent.
  • Confidence raiser: add a strict ARN+session STS mock contract test and run the existing Glue integration suite with that assertion enabled.

9. Residual risks

  • Runtime auth failures may still appear only under specific IAM/policy environments not replicated by current mocks.
  • Lack of role ARN validation in test doubles can allow subtle authorization regressions to pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants