Skip to content

Conversation

@eddiebergman
Copy link
Collaborator

@eddiebergman eddiebergman commented Nov 24, 2022

Reference Issue

Fixes #1159

What does this PR implement/fix? Explain your changes.

The Minio client does not automatically resolve proxies as is currently being done for requests. This PR uses requests strategy to automatically resolve the correct proxy and use this with the Minio client. With the shift to .pq files which are communcated with by Minio, this should help transition away from arff with less complaints.

Previously, Minio would do repeated calls to the download server but have long timeouts and ignore openml.config settings before finally failing and throwing an error.

How should this PR be tested?

I don't know, if you have a proxy, please use it along with this following script:

# Runs fine now
python test-proxy.py

# Will hang as we disable proxies which is apparently common in curl,wget,requests
no_proxy="*" python test-proxy.py
# test-proxy.py
import openml
import shutil
from pathlib import Path

did = 2
path = Path(openml.config.get_cache_directory()) / "datasets" / str(did)
if path.exists():
    shutil.rmtree(str(path))
    
assert not path.exists()

d = openml.datasets.get_dataset(did, download_data=True)
print(d)

assert path.exists()

I'm unfamiliar with python's networking libraries and how to test proxies in general so I excluded tests.

Any other comments?

  • Proxies are handled without user knowledge with requests and so Minio will do so too. I think this makes sense but if you wish to make it more explicit somehow, please let me know
  • I left a parameter proxy: str | bool in the _download_minio_file specifically to document the behavior and the user has no way to interact with this beyond environment variables (as is requests)
  • I didn't think this required an example as there is non for requests and proxies, they are now alligned.
  • Uni-freiburg and its cluster users would really appreciate a release with these changes! 🥂

Copy link
Collaborator

@PGijsbers PGijsbers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unfamiliar with python's networking libraries and how to test proxies in general so I excluded tests.

Same here, unfortunately. A manual test at least confirms the defaults don't break when not behind a proxy 😅

Uni-freiburg and its cluster users would really appreciate a release with these changes!

I was planning a release today, let's get it in! (btw, is it correct you are not on OpenML Slack?)

@eddiebergman
Copy link
Collaborator Author

I put in your "auto" suggestion and it's looking a little cleaner, thanks for that!

Same here, unfortunately. A manual test at least confirms the defaults don't break when not behind a proxy sweat_smile

That's good at least! I would be lying if I said I actually tried it without a proxy, I was hoping the tests would all catch that :) Brief look at the failing tests makes me think it's unrelated?

(btw, is it correct you are not on OpenML Slack?)
I'm not actually, if it's relatively active I'd be thrilled to join to know what's going on!

@PGijsbers PGijsbers merged commit a909a0c into openml:develop Nov 25, 2022
PGijsbers added a commit to PGijsbers/openml-python that referenced this pull request Nov 25, 2022
@PGijsbers
Copy link
Collaborator

I'm not actually, if it's relatively active I'd be thrilled to join to know what's going on!

Not always, but it can be very convenient to quickly discuss things or ask a question :) (for example, the server issue last week, or me notifying you of the new openml-python release so you can do a quick test later)

PGijsbers added a commit that referenced this pull request Nov 25, 2022
* Towards downloading buckets

* Download entire bucket instead of dataset file

* Dont download arff, skip files already cached

* Automatically unzip any downloaded archives

* Make downloading the bucket optional

Additionally, rename old cached files to the new filename format.

* Allow users to download the full bucket when pq is already cached

Otherwise the only way would be to delete the cache.

* Add unit test stub

* Remove redundant try/catch

* Remove commented out print statement

* Still download arff

* Towards downloading buckets

* Download entire bucket instead of dataset file

* Dont download arff, skip files already cached

* Automatically unzip any downloaded archives

* Make downloading the bucket optional

Additionally, rename old cached files to the new filename format.

* Allow users to download the full bucket when pq is already cached

Otherwise the only way would be to delete the cache.

* Add unit test stub

* Remove redundant try/catch

* Remove commented out print statement

* Still download arff

* ADD: download all files from minio bucket

* Add note for #1184

* Fix pre-commit issues (mypy, flake)

Co-authored-by: Matthias Feurer <[email protected]>
PGijsbers pushed a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
* feat(minio): Allow for proxies

* fix: Declared proxy_client as None

* refactor(proxy): Change to `str | None` with "auto"
PGijsbers added a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
* Towards downloading buckets

* Download entire bucket instead of dataset file

* Dont download arff, skip files already cached

* Automatically unzip any downloaded archives

* Make downloading the bucket optional

Additionally, rename old cached files to the new filename format.

* Allow users to download the full bucket when pq is already cached

Otherwise the only way would be to delete the cache.

* Add unit test stub

* Remove redundant try/catch

* Remove commented out print statement

* Still download arff

* Towards downloading buckets

* Download entire bucket instead of dataset file

* Dont download arff, skip files already cached

* Automatically unzip any downloaded archives

* Make downloading the bucket optional

Additionally, rename old cached files to the new filename format.

* Allow users to download the full bucket when pq is already cached

Otherwise the only way would be to delete the cache.

* Add unit test stub

* Remove redundant try/catch

* Remove commented out print statement

* Still download arff

* ADD: download all files from minio bucket

* Add note for openml#1184

* Fix pre-commit issues (mypy, flake)

Co-authored-by: Matthias Feurer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NewConnectionError /datasets2/dataset_2.pq

2 participants