Conversation
| @@ -0,0 +1,64 @@ | |||
| from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError | |||
There was a problem hiding this comment.
this looks new, but it's just a file renaming. In real this should go out soon, if everything goes well with ray.
| log_lines = log.split("\n") | ||
| # first 10 lines: | ||
| log_head = "\n".join(log_lines[:head_lines]) | ||
|
|
||
| try: | ||
| traceback_idx = log.rindex("Traceback (most recent call last):") | ||
| tail_idx = log.rindex("action:", 0, traceback_idx) | ||
| log_tail = log[tail_idx:] | ||
| except ValueError: | ||
| log_tail = "\n".join(log_lines[-tail_lines:]) | ||
|
|
||
| return log_head + "\n...\n...truncated middle of the log\n...\n" + log_tail | ||
|
|
||
|
|
There was a problem hiding this comment.
There is an edge case not handled here, when len(log_lines) < head_lines + tail_lines
| run_exp = ray.remote(run_exp) | ||
|
|
||
|
|
||
| def execute_task_graph(exp_args_list: list[bgym.ExpArgs], avg_step_timeout=30): |
There was a problem hiding this comment.
I get it but I was a bit confused with the task naming which is used for browsergym tasks in other parts of the code. Maybe change task for exp here ?
| include_errors: str | ||
| Find all incomplete experiments and relaunch them. | ||
| - "incomplete_only": relaunch only the incomplete experiments. | ||
| - "incomplete_or_error": relaunch incomplete or errors. |
There was a problem hiding this comment.
Seems to be a boolean now, doc should be updated?
src/agentlab/experiments/study.py
Outdated
| try: | ||
| self.benchmark.prepare_backends() | ||
| except ReadTimeout: | ||
| logger.warning("Backend preparation timed out. Continuing anyway.") | ||
|
|
There was a problem hiding this comment.
You can remove this now, webarena reset should not timeout any more
| try: | |
| self.benchmark.prepare_backends() | |
| except ReadTimeout: | |
| logger.warning("Backend preparation timed out. Continuing anyway.") | |
| self.benchmark.prepare_backends() | |
| @@ -31,7 +31,14 @@ def make_assistant_message(content: str) -> dict: | |||
| class CheatMiniWoBLLM(AbstractChatModel): | |||
There was a problem hiding this comment.
If it's only meant to be used in the tests, should be moved to the test file
…arning for ignored dependencies in _agents_on_benchmark
…gnored dependencies
…s; update get_results method to conditionally save outputs
…vg_step_timeout and rename test function
| run_exp = ray.remote(run_exp) | ||
|
|
||
|
|
||
| def execute_task_graph(exp_args_list: list[bgym.ExpArgs], avg_step_timeout=30): |
There was a problem hiding this comment.
look how beautiful and simple this is
| run_exp = ray.remote(run_exp) | ||
|
|
||
|
|
||
| def execute_task_graph(exp_args_list: list[bgym.ExpArgs], avg_step_timeout=30): |
| elif parallel_backend == "ray": | ||
| from agentlab.experiments.graph_execution_ray import execute_task_graph, ray | ||
|
|
||
| context = ray.init(num_cpus=n_jobs, dashboard_host="127.0.0.1", dashboard_port=8265) |
There was a problem hiding this comment.
I think we should remove explicit dashboard_host and dashboard_port to let ray chose it
gasse
left a comment
There was a problem hiding this comment.
Two minor changes to add, otherwise LGTM :)
| import sys | ||
| from time import time, sleep | ||
|
|
||
| logger = logging.getLogger("agentlab." + __name__) # Get logger based on module name |
There was a problem hiding this comment.
Shouldn't the "agentlab." part already be included in __name__ ?
| try: | ||
| yield | ||
| finally: | ||
| signal.alarm(0) | ||
| signal.signal(signal.SIGALRM, previous_handler) |
There was a problem hiding this comment.
This is black magic to me. I'll just trust you :)
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
…ependency graph plotting; adjust node size and font size for better visualization
* dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
* Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
* downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * Fix prompt formatting in Observation and add static method to Study class (#110) * Bug fix (#111) * Fix prompt formatting in Observation and add static method to Study class * Update gradio version to 5.5 to fix DataFrame scrolling issue * Fixing openrouter pricing rate limit (#112) * Update unit_tests.yml (#101) * request is done once and then reused * Patching minor stuff (#69) * fixing sample_std for single experience * making gradio shared server non default * missing requirement for xray * Improve agent xray app (#70) * 0.2.2 Release (#67) * downgrading ubuntu version for github tests (#62) * Llm api update (#59) * getting rid of .invoke() * adding an AbstractChatModel * changing chat_api structure * Reproducibility again (#61) * core functions * switch to dask * removing joblib dependency and adding dask * fixing imports * handles multiple backends * ensure asyncio loop creation * more tests * setting dashboard address to None * minor * Finally found a way to make it work * initial reproducibility files * Seems to be superflus * adding a reproducibility journal * minor update * more robust * adding reproducibility tools * fix white listing * minor * minor * minor * minor * minor fix * more tests * more results yay * disabling this test * update * update * black * maybe fixing github workflow ? * make get_git_username great again * trigger change * new browsergym * GPT-4o result (and new comment column) * Seems like there was a change to 4o flags, trying these * minor comment * better xray * minor fix * addming a comment field * new agent * another test with GPT-4o * adding llama3 from openrouter * fix naming * unused import * new summary tools and remove "_args" from columns in results * add Llama * initial code for reproducibility agent * adjust inspect results * infer from benchmark * fix reproducibility agent * prevent the repro_dir to be an index variable * updating repro agent stats * Reproducibility agent * instructions to setup workarena * fixing tests * handles better a few edge cases * default progress function to None * minor formatting * minor * initial commit * refactoring with Study class * refactor to adapt for study class * minor * fix pricy test * fixing tests * tmp * print report * minor fix * refine little details about reproducibility * minor * no need for set_temp anymore * sanity check before running main * minor update * minor * new results with 4o on workarena.l1 * sharing is caring * add llama to main.py * new hournal entry * lamma 3 70B * minor * typo * black fix (wasn't configured) --------- Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com> * version bump --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * Make share=TRue into a environment variable, disabled by default for security * fix floating point issue with std_reward in agent xray * Update src/agentlab/analyze/inspect_results.py * Update src/agentlab/analyze/agent_xray.py --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * added tmlr definitive config (#71) * downgrading gradio version (#77) * Study refactor (#73) * adapting to new Benchmark class * fixing tests * fix tests * typo * not ready for gradio 5 * study id and a few fixes * fixing pricy tests --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * adding message class and updating generic agent accordingly (#68) * adding message class and updating generic agent accordingly * updating tests * Reproducibility test before message class * Adding inspect_result.ipynb to reprod white list * Reproducibility test after message class * L1 before message class * L1 after message class * added append as method to the Discussion class, to make it totally similar to a list * changed to_markdown behavior * updated most_basic_agent * updated ReproAgent * Update src/agentlab/analyze/agent_xray.py * format * new journal entry * immutable as default kwarg * removing __add__ and __radd__ * added deprecation warning * updating tests * version bump * Updating generic_agent to fit use BGym's goal_object (#83) * updating generic agent to goal_object * fixing image markdown display * updating tests * fixing intruction BaseMessage * added merge text in discussion * added merge to discussion class * added tests * Minor revert (#86) * minor revert * revert tests too * Add tabs (#84) * add tabs * make sure it's not computed if not visible * Fix reproduce study (#87) * add tabs * this workaround is worst * bug fix * fix reproduce study * make sure it's not computed if not visible * upgrading gradio dependency (#88) * bgym update (#90) * Workarena TMLR experiments (#89) * new entry * adding llm configs * new journal entries * handling sequntial in VWA (#91) * handling sequntial in VWA * enable comments * format --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Tmlr workarena (#92) * adding llm configs * new L1 entries * tmp * reformat * adding assistantbench to reproducibility_util.py * gitignore (#97) * Vision fix (#105) * changing content name * Update src/agentlab/llm/llm_utils.py --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * L2 tmlr (#93) * adding llm configs * L2 entries * claude L3 * claude vision support * miniwob results * 405b L1 entry * Replacing Dask with Ray (#100) * dask-dependencies * minor * replace with ray * adjust tests and move a few things * markdown report * automatic relaunch * add dependencies * reformat * fix unit-test * catch timeout * fixing bugs and making things work * adress comments and black format * new dependencies viewer * Update benchmark to use visualwebarena instead of webarena * Fix import and uncomment code in get_ray_url.py * Add ignore_dependencies option to Study and _agents_on_benchmark functions * Update load_most_recent method to include contains parameter * Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark * Refactor backend preparation in Study class and improve logging for ignored dependencies * finallly some results with claude on webarena * Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs * black * ensure timeout is int (For the 3rd time?) * Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function * black * Change parallel backend from "joblib" to "ray" in run_experiments function * Update src/agentlab/experiments/study.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Update src/agentlab/analyze/inspect_results.py Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * switching to 2 for loops in _agents_on_benchmark (#107) * yet another way to kill timedout jobs (#108) * request is done once and then reused * switched to caching original function bc it doesnt break to tests * added a catch for some openrouter under-the-hood error --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> * updating max prompt configs, vision support (#109) * Cross-product deepcopy fix (#106) Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> * slugify study_name (#114) * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * Weblinx results (#104) * adding weblinx results * adding old weblinx results --------- Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * Max new tokens fix (#118) * Lower max_new_tokens for OpenAI models * updating configs --------- Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com> Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com> * version bump (#119) * fix format (#120) * Clean pipeline (#117) * yet another way to kill timedout jobs * Improve timeout handling in task polling logic * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * Improve timeout handling in task polling logic * yet another way to kill timedout jobs (#108) * Add method to override max_steps in Study class * add support for tab visibility in observation flags and update related components * fix tests * black * black * Fix sorting bug. improve directory content retrieval with summary statistics * fix test * black * tmp * add error report, add cum cost to summary and ray backend by default * black * fix test (chaing to joblib backend) * black --------- Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> --------- Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com> Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com> Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com> Co-authored-by: Léo Boisvert <leo.boisvert@hotmail.ca>
No description provided.