Skip to content

Conversation

@FabiPi3
Copy link
Contributor

@FabiPi3 FabiPi3 commented Nov 17, 2025

I have a use case where I propagate a large dictionary with many sub-dictionaries and many keys to the output of a job. I don't have full control over this dictionary as it is part of the user input. I use the addtional stores feature to store large parts of my output into additional stores. Now it could happen by chance that one of my additional store keys matches with one of the keys in the user-provided dictionary. Than this would be put into the additional store as well, which is not what I want.

A simple example showing this would be:

from jobflow import job, run_locally, JobStore
from maggma.stores import MemoryStore

store = JobStore(MemoryStore(), additional_stores={"data": MemoryStore()})

@job(data=["ab"])
def partial_data():
    return {"xyz": {"ab": 1}, "cd": 4, "xml": {"ab": 3}}

partial_job = partial_data()
run_locally(partial_job, store=store)
print(store.query_one({"uuid": partial_job.uuid})["output"])

You will see two blob_uuids in the output. But I only want a single one.

The option to do this I propose is to give the full path directly as a list:

@job(data=[["output", "xyz", "ab"]])
def partial_data():
    return {"xyz": {"ab": 1}, "cd": 4, "xml": {"ab": 3}}

partial_job = partial_data()
run_locally(partial_job, store=store)
print(store.query_one({"uuid": partial_job.uuid})["output"])

Now only the given path is put into the additional store and the other one not.

TODO

  • Should the output part in the list be omitted and prepended automatically?
  • One could think about doing something similar for the load procedure.

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Dec 1, 2025

Pinging @utf @gpetretto @davidwaroquiers. Any opinion on this? Thanks.

@davidwaroquiers
Copy link
Contributor

Hi @FabiPi3

I think it's interesting to have such an option. One thing that comes to mind is that maybe we could use mongo's dot notation instead of a list ? In your example, the decorator would be:

@job(data=["output.xyz.ab"])

I haven't looked at the implementation but just commented from the user's perspective at this stage.

Copy link
Contributor

@gpetretto gpetretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FabiPi3,

this seems like an interesting option. I think I would have had a use for this in the past.
I did not make a full review of the code, although it is quite small, so I don't see big problems. I just left a small comment.

As for @davidwaroquiers comment. Indeed it may have been nice to use use the dot notation instead of a list. However there may be two issues:

  1. MongoDB actually allows to have dots in the keyword names, although discouradged. How would you tell those apart?
  2. a "location" coming from find_key can contain an integer in case the path to the location is inside a list. But in the dot notation, how to tell apart an integer from a key that is instead a string including an integer?
    These are both boarderline cases, but since this is to provide a fine control on the data to be stored, I would suggest to stick to the list.

Cocerning your two questions:

  1. On one hand I think that in principle everybody would want to focus on something that is inside the output key. But on the other hand the standard behaviour can move to the additional store any part of the document, included the root keywords of the JobStoreDocuments (e.g. metadata). I am not sure what would be the best option here, but I would leave things as you implemented (with the need to include output), just to be consistent between the standard and the new option.
  2. You mean allowing to pass a list to load (or a list of lists) and this will only load the specified data. This may indeed be interesting, but maybe it is a bit more involved to implement?

for store_name, store_save in save_keys.items():
for save_key in store_save:
locations.extend(find_key(doc, save_key, include_end=True))
if isinstance(save_key, list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe here also match if it is a tuple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@FabiPi3
Copy link
Contributor Author

FabiPi3 commented Dec 8, 2025

Thanks for your feedback. Sounds like there is no reason to not merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants