Fix #951: Fallback gracefully if a library can't be loaded#1052
Fix #951: Fallback gracefully if a library can't be loaded#1052mdboom wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
| try: | ||
| handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint | ||
| except DynamicLibNotFoundError: | ||
| handle = 0 |
There was a problem hiding this comment.
Is calling dlsym on a NULL pointer handle undefined behavior? It seems with this change that code path is now easily possible.
There was a problem hiding this comment.
According to https://pubs.opengroup.org/onlinepubs/009604299/functions/dlsym.html, it says
If handle does not refer to a valid object opened by dlopen(), or if the named symbol cannot be found within any of the objects associated with handle, dlsym() shall return NULL.
But then there's the issue of the special values of handle, RTLD_DEFAULT and RTLD_NEXT. I suspect either of these could be defined as (void*)0, but in any case the specific value doesn't matter, you just can't use whatever their value is unless you want that behavior.
There was a problem hiding this comment.
Seems like from the below code that RTLD_DEFAULT is the desired behavior, so maybe that should be returned to ensure that we're not triggering undefined behavior.
There was a problem hiding this comment.
FWIW, this is reverting to behavior prior to the move to using pathfinder, so it's been testing for a long time in the wild with handle == NULL in those situations.
But to your question, it's kinda/sorta. On GNU, the special value RTLD_DEFAULT is equal to 0, so it's defined behavior to lookup the symbol in the global namespace -- for example if the library is statically linked. All that is fine -- we are already doing that before looking in the .so. Other POSIX platforms mileage may vary, of course.
There was a problem hiding this comment.
Jinx. Yeah, I agree that maybe setting to RTLD_DEFAULT here would be better.
Windows is another kettle of fish -- something else might be better there.
There was a problem hiding this comment.
Ok, yeah jinxing all over the place 🤣
There was a problem hiding this comment.
For Windows it looks like its dlsym-equivalent has a similar behavior to RTLD_DEFAULT specifically when it is given NULL.
Not sure about this actually.
There was a problem hiding this comment.
FWIW, this is reverting to behavior prior to the move to using
pathfinder, so it's been testing for a long time in the wild withhandle == NULLin those situations.
I was wondering about this too. Turns out we did not test this behavior at all. This is from 12.8.0 (whose behavior we should restore to):
cuda-python/cuda_bindings/cuda/bindings/_internal/nvjitlink_linux.pyx
Lines 55 to 65 in c04025d
If a library is not loadable, we just raised an exception in load_library, so we never returned the null handle to complete the remaining symbol loading. I think to fix #951 we just need to swallow the exception raised by the pathfinder, and raise the one we used to raise?
There was a problem hiding this comment.
Ah, ok. I missed that in all this. That certainly makes this less weird. Will change.
There was a problem hiding this comment.
Hmm... That can't be the whole story. A DynamicLibNotFoundError is a subclass of RuntimeError. Puzzle to solve...
|
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Closing -- I'm not sure it's fixing the right thing. See comment: #951 (comment) |
Description
Allows falling back to another library more lazily.
closes #951
Checklist