-
-
Notifications
You must be signed in to change notification settings - Fork 774
don't stop on encoding-errors #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| for filename in filenames: | ||
| with open(filename, 'r', errors='ignore') as content: | ||
| text_body = content.read() | ||
| with io.open(filename, 'rb') as content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see io.open(filename, encoding='UTF-8') -- errors='ignore' can lead to false negatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
providing an encoding does not solve my issue (a file which is not utf-8 encoded).
Using ignore means throwing away the characters which cannot be decoded as UTF-8. Since the AWS-Credentials will be ASCII-Strings (this is me claiming something without having any prove) it may only lead to false positives, which would not be that bad imo. I can try to provide further information on this and tests for various scenarios, but it will take a few weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik, ignore=... and replace=... can remove bytes surrounding an error (consider a surrogate or continuation byte followed by garbage) so they could in fact remove part of the key leading to no-match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds plausible to me, I did not know about this. The question is, what is the expected behavior for detect-aws-credentials on non-utf-8 files? I could live with skipping them or with opening them as utf-8 with ignore or replace - in none of these scenarios credential-detection will not (reliably) work for the file, as you pointed out. I cannot see how to build an encoding agnostic solution. If you've got any idea I'll happy to do try to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another idea would be to scan the files as binary files and use the bytes representation of the credentials -- then you don't worry about the encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one you can use as an example for that: https://github.com/pre-commit/pre-commit-hooks/blob/master/pre_commit_hooks/detect_private_key.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you very much - I'll look into it and try to come up with a PR next weekend.
ssbarnea
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to understand the purpose of this change. If it is a bugfix it should refer to an existing bug that documents clearly the problem.
Based on its title, I would directly close the PR as I don't find ignoring errors as a good practice in general.
we're iterating on the hook as seen above -- there's a way to make this work without dealing with encoding problems |
|
via #453 -- thanks again for working on this! |
No description provided.