Skip to content

Fix wrong path encoding#1537

Open
ianton-ru wants to merge 5 commits intoantalya-25.8from
bugfix/antalya-25.8/1536_fix_wrong_path_decodig
Open

Fix wrong path encoding#1537
ianton-ru wants to merge 5 commits intoantalya-25.8from
bugfix/antalya-25.8/1536_fix_wrong_path_decodig

Conversation

@ianton-ru
Copy link

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1536
Fix incorrect encoding special characters in S3 paths.

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Mar 16, 2026

Workflow [PR], commit [68a487f]

@ianton-ru ianton-ru force-pushed the bugfix/antalya-25.8/1536_fix_wrong_path_decodig branch from 58d944a to 38a7629 Compare March 17, 2026 09:21
@ianton-ru
Copy link
Author

AI audit note: This review comment was generated by AI (gpt-5.4).

Audit update for PR #1537 after the latest branch update:

Confirmed defects

No confirmed defects in reviewed scope.

Coverage summary

Scope reviewed: the updated DB::S3::URI::URI flow in src/IO/S3/URI.cpp, especially the two remaining changed paths Poco::URI(uri_str, false) and URIConverter::modifyURI(..., false), plus the Iceberg/object-storage call chain that feeds absolute S3 paths into S3::URI, existing S3 URI unit coverage in src/IO/tests/gtest_s3_uri.cpp, the new Iceberg integration coverage in tests/integration/test_database_iceberg/test.py, and adjacent scheme-mapper/archive-syntax coverage.
Categories failed: none.
Categories passed: raw %xx object-key preservation on initial parse; scheme-mapper remap preserving encoded path/query bytes; compatibility of the unchanged non-versionId ? fallback with existing S3 URI tests; archive-syntax split occurring before the changed parse step; no confirmed concurrency/lifetime/rollback issues in scope.
Assumptions/limits: static audit only, no runtime execution; there is still limited direct test coverage for combinations of encoded paths with scheme-mapper plus archive/glob edge cases, but I did not confirm a defect from the latest changed lines.

mkmkme
mkmkme previously approved these changes Mar 17, 2026
@subkanthi
Copy link
Collaborator

@ianton-ru , u are probably addressing this one too #1516 as part of this PR.

@ianton-ru
Copy link
Author

Test 03212_variant_dynamic_cast_or_default is flaky, see ClickHouse#87583

@ianton-ru
Copy link
Author

Issue was introduced in #1092.
In Iceberg metadata paths are already in correct form
S3::URI decodes path, and column=12%3A00%3A00  converted into column=12:00:00 . Before #1092 paths are used "as is".
It's impossible to encode it back later, because character =  also encoded into %3D in this case. You can make string as partition key, and put in string =  character, in this case path looks like column=foo%3Dbar , after decoding it is converted into column=foo=bar  and you lost information, which character was encoded and which wasn't.

mkmkme
mkmkme previously approved these changes Mar 18, 2026
@mkmkme
Copy link
Collaborator

mkmkme commented Mar 18, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b51668512f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

uri_str = uri_;

uri = Poco::URI(uri_str);
uri = Poco::URI(uri_str, enable_url_encoding);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pass enable_url_encoding through the wildcard ? reparse

The new enable_url_encoding plumbing is incomplete for URIs that contain a literal ? in the key. S3::URI still reparses those inputs via Poco::URI(uri_with_question_mark_encode) with the default encoding mode, and the existing gtest at src/IO/tests/gtest_s3_uri.cpp:29-33 shows that non-versionId ? segments are intentionally treated as part of key. In a path such as .../a%2Fb?marker, callers from resolveObjectStorageForPath(..., enable_url_encoding=false) will still decode %2F back to / after this second parse, so absolute metadata paths that rely on preserved percent-encoding continue to resolve the wrong object.

Useful? React with 👍 / 👎.

Copy link
Author

@ianton-ru ianton-ru Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This methos is called not with any URL, but only with path in S3 or Azure storage. Question mark should be encoded in S3 object name https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html as well as in Azure https://learn.microsoft.com/en-us/azure/devops/organizations/settings/naming-restrictions?view=azure-devops&tabs=git.

@ianton-ru
Copy link
Author

@ianton-ru , u are probably addressing this one too #1516 as part of this PR.

@subkanthi Looks like your solution gives the same result, and more elegant.

@ianton-ru
Copy link
Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e485a91f45

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

S3::URI s3_uri(normalized_path);

std::string key_to_use = s3_uri.key;
std::string key_to_use = target_decomposed.key;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve parsed S3 key for path-style URLs

Using target_decomposed.key here regresses path-style S3 HTTP(S) URIs (for example, https://s3.amazonaws.com/<bucket>/<key>): SchemeAuthorityKey keeps bucket/<key> in key, while s3_uri has already split bucket and object key correctly. The matching logic still uses s3_uri for bucket/endpoint checks, so when a storage match is found we now return a key with an extra bucket prefix, which makes object lookups fail with missing-file errors for those URLs.

Useful? React with 👍 / 👎.

This reverts commit ddd34ac.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants