Bgym updates #90

TLSDC · 2024-10-23T15:58:40Z

No description provided.

* Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * Fix prompt formatting in Observation and add static method to Study class (#110) * Bug fix (#111) * Fix prompt formatting in Observation and add static method to Study class * Update gradio version to 5.5 to fix DataFrame scrolling issue * Fixing openrouter pricing rate limit (#112) * Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * updating max prompt configs, vision support (#109) * Cross-product deepcopy fix (#106) Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * slugify study_name (#114) * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * Weblinx results (#104) * adding weblinx results * adding old weblinx results --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Max new tokens fix (#118) * Lower max_new_tokens for OpenAI models * updating configs --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * version bump (#119) * fix format (#120) * Clean pipeline (#117) * yet another way to kill timedout jobs * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * Improve timeout handling in task polling logic * yet another way to kill timedout jobs (#108) * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * black * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * tmp * add error report, add cum cost to summary and ray backend by default * black * fix test (chaing to joblib backend) * black --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Léo Boisvert <leo.boisvert@hotmail.ca>

bgym update

7116ad5

TLSDC changed the base branch from main to dev October 23, 2024 15:59

TLSDC merged commit 176fe8a into dev Oct 23, 2024

TLSDC deleted the bgym_updates branch October 23, 2024 15:59

TLSDC added a commit that referenced this pull request Nov 7, 2024

bgym update (#90)

77b076c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bgym updates #90

Bgym updates #90

Uh oh!

TLSDC commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bgym updates #90

Bgym updates #90

Uh oh!

Conversation

TLSDC commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants