Skip to content

Conversation

@TLSDC
Copy link
Collaborator

@TLSDC TLSDC commented Oct 16, 2024

No description provided.

if elem["type"] == "text":
res.append(elem["text"])
elif elem["type"] == "image":
res.append(f"![]({elem['url']})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow!! does that work?


def __add__(self, other: Union[str, BaseMessage, "Discussion"]):
if isinstance(other, BaseMessage):
res = deepcopy(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should we deepcopy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safety for weird cases where disc2 = msg + disc1 would update disc1

Copy link
Collaborator

@recursix recursix Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we useing __add__ or __radd__ at the moment? I'm thinking we should not support this. Most of the use cases would just be add_message... right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah maybe i tried doing something too fancy, i'll change it

else:
raise ValueError(f"Cannot add a {type(other)} to a Discussion.")

def __radd__(self, other: Union[str, BaseMessage, "Discussion"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because why not?

@TLSDC TLSDC requested a review from recursix October 17, 2024 18:38
@TLSDC
Copy link
Collaborator Author

TLSDC commented Oct 19, 2024

Pricy tests pass at this point I think we could merge @recursix

Comment on lines +13 to +16
ThibaultLSDC,GenericAgent-gpt-4o-mini-2024-07-18,miniwob,0.8.1,2024-10-17_10-13-28,,0.557,0.02,0,625/625,None,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.7,1.39.0,0.2.2,7bba275c004f1f90dfd83eaaab963ab5066e2baf,,0.8.1,None,
ThibaultLSDC,GenericAgent-gpt-4o-mini-2024-07-18,miniwob,0.8.1,2024-10-17_10-50-53,,0.563,0.02,0,625/625,None,Darwin (Darwin Kernel Version 23.6.0: Mon Jul 29 21:13:00 PDT 2024; root:xnu-10063.141.2~1/RELEASE_X86_64),3.12.7,1.39.0,0.2.2,057b7d4a201cc1cd1ebd7bc884f6a91e104c479d,,0.8.1,None,
ThibaultLSDC,GenericAgent-gpt-4o-mini-2024-07-18,workarena.l1,0.4.1,2024-10-17_17-30-43,,0.258,0.024,0,330/330,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.2,7bba275c004f1f90dfd83eaaab963ab5066e2baf,,0.8.1,None,
ThibaultLSDC,GenericAgent-gpt-4o-mini-2024-07-18,workarena.l1,0.4.1,2024-10-17_18-30-28,,0.273,0.025,0,330/330,None,Linux (#66-Ubuntu SMP Fri Aug 30 13:56:20 UTC 2024),3.12.7,1.39.0,0.2.2,8b2b3f39a2bdb9efafad97791536a0b8cff4e708,,0.8.1,None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison of before/after performances w/ 4o mini on miniwob and workarena.l1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this with our without the new benchmark class with new miniwob action space?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st 3rd line are without
2nd 4th are with

Comment on lines +420 to +421
def append(self, message: BaseMessage | dict):
self.add_message(message)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes things even more retro-compatible, leaving many previous functions untouched

Comment on lines 585 to 589
if isinstance(chat_messages, Discussion):
return chat_messages.to_markdown()
messages = []
for i, m in enumerate(chat_messages):
if isinstance(m, BaseMessage): # TODO remove once langchain is deprecated
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Discussion class deprecates a lot code pieces but I figured it might be safer to keep for a while

Comment on lines 256 to 257
elif isinstance(prompt, list):
prompt_str = "\n".join([p["text"] for p in prompt if p["type"] == "text"])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be deprecated

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about we put a warning deprecated?

Comment on lines +585 to 590
if isinstance(chat_messages, Discussion):
return chat_messages.to_markdown()
messages = [] # TODO(ThibaultLSDC) remove this at some point
for i, m in enumerate(chat_messages):
if isinstance(m, BaseMessage): # TODO remove once langchain is deprecated
m = m.content
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Discussion class deprecates a lot code pieces but I figured it might be safer to keep for a while

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backward compatibility? Perhaps we can wrap backward compatible code in some isolated function (no need to do now).

As long as we're at laeast forward compatible :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that work with AgentInfo in browsergym. They type won't be Discussion since it's only defined in AgentLab.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't thought of that. I feel like it would be a weird usecase to use only browsergym on traces that were made with Agentlab though

"*/reproducibility_script.py",
"*reproducibility_journal.csv",
"*main.py",
"*inspect_results.ipynb",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just convenient



class Discussion:
def __init__(self, messages: Union[list[BaseMessage], BaseMessage] = []):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use mutable as default argument.

Super dangerous here.

d1 = Discussion()
d2 = Discussion()

d1.add_message("allo")
d2.add_message("party")
# now d1 and d2 are both of len 2

Copy link
Collaborator

@recursix recursix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super good, juste need to modify the mutable default arg

@TLSDC TLSDC merged commit 98e5a22 into dev Oct 21, 2024
@TLSDC TLSDC deleted the message_class branch October 21, 2024 19:50
TLSDC added a commit that referenced this pull request Nov 7, 2024
* adding message class and updating generic agent accordingly

* updating tests

* Reproducibility test before message class

* Adding inspect_result.ipynb to reprod white list

* Reproducibility test after message class

* L1 before message class

* L1 after message class

* added append as method to the Discussion class, to make it totally similar to a list

* changed to_markdown behavior

* updated most_basic_agent

* updated ReproAgent

* Update src/agentlab/analyze/agent_xray.py

* format

* new journal entry

* immutable as default kwarg

* removing __add__ and __radd__

* added deprecation warning

* updating tests
TLSDC added a commit that referenced this pull request Nov 7, 2024
* Update unit_tests.yml (#101)

* request is done once and then reused

* Patching minor stuff (#69)

* fixing sample_std for single experience

* making gradio shared server non default

* missing requirement for xray

* Improve agent xray app (#70)

* 0.2.2 Release (#67)

* downgrading ubuntu version for github tests (#62)

* Llm api update (#59)

* getting rid of .invoke()

* adding an AbstractChatModel

* changing chat_api structure

* Reproducibility again (#61)

* core functions

* switch to dask

* removing joblib dependency and adding dask

* fixing imports

* handles multiple backends

* ensure asyncio loop creation

* more tests

* setting dashboard address to None

* minor

* Finally found a way to make it work

* initial reproducibility files

* Seems to be superflus

* adding a reproducibility journal

* minor update

* more robust

* adding reproducibility tools

* fix white listing

* minor

* minor

* minor

* minor

* minor fix

* more tests

* more results yay

* disabling this test

* update

* update

* black

* maybe fixing github workflow ?

* make get_git_username great again

* trigger change

* new browsergym

* GPT-4o result (and new comment column)

* Seems like there was a change to 4o flags, trying these

* minor comment

* better xray

* minor fix

* addming a comment field

* new agent

* another test with GPT-4o

* adding llama3 from openrouter

* fix naming

* unused import

* new summary tools and remove "_args" from columns in results

* add Llama

* initial code for reproducibility agent

* adjust inspect results

* infer from benchmark

* fix reproducibility agent

* prevent the repro_dir to be an index variable

* updating repro agent stats

* Reproducibility agent

* instructions to setup workarena

* fixing tests

* handles better a few edge cases

* default progress function to None

* minor formatting

* minor

* initial commit

* refactoring with Study class

* refactor to adapt for study class

* minor

* fix pricy test

* fixing tests

* tmp

* print report

* minor fix

* refine little details about reproducibility

* minor

* no need for set_temp anymore

* sanity check before running main

* minor update

* minor

* new results with 4o on workarena.l1

* sharing is caring

* add llama to main.py

* new hournal entry

* lamma 3 70B

* minor

* typo

* black fix (wasn't configured)

---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>

* version bump

---------

Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* Make share=TRue into a environment variable, disabled by default for security

* fix floating point issue with std_reward in agent xray

* Update src/agentlab/analyze/inspect_results.py

* Update src/agentlab/analyze/agent_xray.py

---------

Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* added tmlr definitive config (#71)

* downgrading gradio version (#77)

* Study refactor (#73)

* adapting to new Benchmark class

* fixing tests

* fix tests

* typo

* not ready for gradio 5

* study id and a few fixes

* fixing pricy tests

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* adding message class and updating generic agent accordingly (#68)

* adding message class and updating generic agent accordingly

* updating tests

* Reproducibility test before message class

* Adding inspect_result.ipynb to reprod white list

* Reproducibility test after message class

* L1 before message class

* L1 after message class

* added append as method to the Discussion class, to make it totally similar to a list

* changed to_markdown behavior

* updated most_basic_agent

* updated ReproAgent

* Update src/agentlab/analyze/agent_xray.py

* format

* new journal entry

* immutable as default kwarg

* removing __add__ and __radd__

* added deprecation warning

* updating tests

* version bump

* Updating generic_agent to fit use BGym's goal_object (#83)

* updating generic agent to goal_object

* fixing image markdown display

* updating tests

* fixing intruction BaseMessage

* added merge text in discussion

* added merge to discussion class

* added tests

* Minor revert (#86)

* minor revert

* revert tests too

* Add tabs (#84)

* add tabs

* make sure it's not computed if not visible

* Fix reproduce study (#87)

* add tabs

* this workaround is worst

* bug fix

* fix reproduce study

* make sure it's not computed if not visible

* upgrading gradio dependency (#88)

* bgym update (#90)

* Workarena TMLR experiments (#89)

* new entry

* adding llm configs

* new journal entries

* handling sequntial in VWA (#91)

* handling sequntial in VWA

* enable comments

* format

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* Tmlr workarena (#92)

* adding llm configs

* new L1 entries

* tmp

* reformat

* adding assistantbench to reproducibility_util.py

* gitignore (#97)

* Vision fix (#105)

* changing content name

* Update src/agentlab/llm/llm_utils.py

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* L2 tmlr (#93)

* adding llm configs

* L2 entries

* claude L3

* claude vision support

* miniwob results

* 405b L1 entry

* Replacing Dask with Ray (#100)

* dask-dependencies

* minor

* replace with ray

* adjust tests and move a few things

* markdown report

* automatic relaunch

* add dependencies

* reformat

* fix unit-test

* catch timeout

* fixing bugs and making things work

* adress comments and black format

* new dependencies viewer

* Update benchmark to use visualwebarena instead of webarena

* Fix import and uncomment code in get_ray_url.py

* Add ignore_dependencies option to Study and _agents_on_benchmark functions

* Update load_most_recent method to include contains parameter

* Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark

* Refactor backend preparation in Study class and improve logging for ignored dependencies

* finallly some results with claude on webarena

* Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs

* black

* ensure timeout is int (For the 3rd time?)

* Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function

* black

* Change parallel backend from "joblib" to "ray" in run_experiments function

* Update src/agentlab/experiments/study.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Update src/agentlab/analyze/inspect_results.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* switching to 2 for loops in _agents_on_benchmark (#107)

* yet another way to kill timedout jobs (#108)

* request is done once and then reused

* switched to caching original function bc it doesnt break to tests

* added a catch for some openrouter under-the-hood error

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
gasse added a commit that referenced this pull request Nov 8, 2024
* Update unit_tests.yml (#101)

* request is done once and then reused

* Patching minor stuff (#69)

* fixing sample_std for single experience

* making gradio shared server non default

* missing requirement for xray

* Improve agent xray app (#70)

* 0.2.2 Release (#67)

* downgrading ubuntu version for github tests (#62)

* Llm api update (#59)

* getting rid of .invoke()

* adding an AbstractChatModel

* changing chat_api structure

* Reproducibility again (#61)

* core functions

* switch to dask

* removing joblib dependency and adding dask

* fixing imports

* handles multiple backends

* ensure asyncio loop creation

* more tests

* setting dashboard address to None

* minor

* Finally found a way to make it work

* initial reproducibility files

* Seems to be superflus

* adding a reproducibility journal

* minor update

* more robust

* adding reproducibility tools

* fix white listing

* minor

* minor

* minor

* minor

* minor fix

* more tests

* more results yay

* disabling this test

* update

* update

* black

* maybe fixing github workflow ?

* make get_git_username great again

* trigger change

* new browsergym

* GPT-4o result (and new comment column)

* Seems like there was a change to 4o flags, trying these

* minor comment

* better xray

* minor fix

* addming a comment field

* new agent

* another test with GPT-4o

* adding llama3 from openrouter

* fix naming

* unused import

* new summary tools and remove "_args" from columns in results

* add Llama

* initial code for reproducibility agent

* adjust inspect results

* infer from benchmark

* fix reproducibility agent

* prevent the repro_dir to be an index variable

* updating repro agent stats

* Reproducibility agent

* instructions to setup workarena

* fixing tests

* handles better a few edge cases

* default progress function to None

* minor formatting

* minor

* initial commit

* refactoring with Study class

* refactor to adapt for study class

* minor

* fix pricy test

* fixing tests

* tmp

* print report

* minor fix

* refine little details about reproducibility

* minor

* no need for set_temp anymore

* sanity check before running main

* minor update

* minor

* new results with 4o on workarena.l1

* sharing is caring

* add llama to main.py

* new hournal entry

* lamma 3 70B

* minor

* typo

* black fix (wasn't configured)

---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>

* version bump

---------

Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* Make share=TRue into a environment variable, disabled by default for security

* fix floating point issue with std_reward in agent xray

* Update src/agentlab/analyze/inspect_results.py

* Update src/agentlab/analyze/agent_xray.py

---------

Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* added tmlr definitive config (#71)

* downgrading gradio version (#77)

* Study refactor (#73)

* adapting to new Benchmark class

* fixing tests

* fix tests

* typo

* not ready for gradio 5

* study id and a few fixes

* fixing pricy tests

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* adding message class and updating generic agent accordingly (#68)

* adding message class and updating generic agent accordingly

* updating tests

* Reproducibility test before message class

* Adding inspect_result.ipynb to reprod white list

* Reproducibility test after message class

* L1 before message class

* L1 after message class

* added append as method to the Discussion class, to make it totally similar to a list

* changed to_markdown behavior

* updated most_basic_agent

* updated ReproAgent

* Update src/agentlab/analyze/agent_xray.py

* format

* new journal entry

* immutable as default kwarg

* removing __add__ and __radd__

* added deprecation warning

* updating tests

* version bump

* Updating generic_agent to fit use BGym's goal_object (#83)

* updating generic agent to goal_object

* fixing image markdown display

* updating tests

* fixing intruction BaseMessage

* added merge text in discussion

* added merge to discussion class

* added tests

* Minor revert (#86)

* minor revert

* revert tests too

* Add tabs (#84)

* add tabs

* make sure it's not computed if not visible

* Fix reproduce study (#87)

* add tabs

* this workaround is worst

* bug fix

* fix reproduce study

* make sure it's not computed if not visible

* upgrading gradio dependency (#88)

* bgym update (#90)

* Workarena TMLR experiments (#89)

* new entry

* adding llm configs

* new journal entries

* handling sequntial in VWA (#91)

* handling sequntial in VWA

* enable comments

* format

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* Tmlr workarena (#92)

* adding llm configs

* new L1 entries

* tmp

* reformat

* adding assistantbench to reproducibility_util.py

* gitignore (#97)

* Vision fix (#105)

* changing content name

* Update src/agentlab/llm/llm_utils.py

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* L2 tmlr (#93)

* adding llm configs

* L2 entries

* claude L3

* claude vision support

* miniwob results

* 405b L1 entry

* Replacing Dask with Ray (#100)

* dask-dependencies

* minor

* replace with ray

* adjust tests and move a few things

* markdown report

* automatic relaunch

* add dependencies

* reformat

* fix unit-test

* catch timeout

* fixing bugs and making things work

* adress comments and black format

* new dependencies viewer

* Update benchmark to use visualwebarena instead of webarena

* Fix import and uncomment code in get_ray_url.py

* Add ignore_dependencies option to Study and _agents_on_benchmark functions

* Update load_most_recent method to include contains parameter

* Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark

* Refactor backend preparation in Study class and improve logging for ignored dependencies

* finallly some results with claude on webarena

* Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs

* black

* ensure timeout is int (For the 3rd time?)

* Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function

* black

* Change parallel backend from "joblib" to "ray" in run_experiments function

* Update src/agentlab/experiments/study.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Update src/agentlab/analyze/inspect_results.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* switching to 2 for loops in _agents_on_benchmark (#107)

* yet another way to kill timedout jobs (#108)

* request is done once and then reused

* switched to caching original function bc it doesnt break to tests

* added a catch for some openrouter under-the-hood error

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
TLSDC added a commit that referenced this pull request Nov 13, 2024
* downgrading ubuntu version for github tests (#62)

* Llm api update (#59)

* getting rid of .invoke()

* adding an AbstractChatModel

* changing chat_api structure

* Reproducibility again (#61)

* core functions

* switch to dask

* removing joblib dependency and adding dask

* fixing imports

* handles multiple backends

* ensure asyncio loop creation

* more tests

* setting dashboard address to None

* minor

* Finally found a way to make it work

* initial reproducibility files

* Seems to be superflus

* adding a reproducibility journal

* minor update

* more robust

* adding reproducibility tools

* fix white listing

* minor

* minor

* minor

* minor

* minor fix

* more tests

* more results yay

* disabling this test

* update

* update

* black

* maybe fixing github workflow ?

* make get_git_username great again

* trigger change

* new browsergym

* GPT-4o result (and new comment column)

* Seems like there was a change to 4o flags, trying these

* minor comment

* better xray

* minor fix

* addming a comment field

* new agent

* another test with GPT-4o

* adding llama3 from openrouter

* fix naming

* unused import

* new summary tools and remove "_args" from columns in results

* add Llama

* initial code for reproducibility agent

* adjust inspect results

* infer from benchmark

* fix reproducibility agent

* prevent the repro_dir to be an index variable

* updating repro agent stats

* Reproducibility agent

* instructions to setup workarena

* fixing tests

* handles better a few edge cases

* default progress function to None

* minor formatting

* minor

* initial commit

* refactoring with Study class

* refactor to adapt for study class

* minor

* fix pricy test

* fixing tests

* tmp

* print report

* minor fix

* refine little details about reproducibility

* minor

* no need for set_temp anymore

* sanity check before running main

* minor update

* minor

* new results with 4o on workarena.l1

* sharing is caring

* add llama to main.py

* new hournal entry

* lamma 3 70B

* minor

* typo

* black fix (wasn't configured)

---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>

* version bump

* Patching minor stuff (#69)

* fixing sample_std for single experience

* making gradio shared server non default

* missing requirement for xray

* Improve agent xray app (#70)

* 0.2.2 Release (#67)

* downgrading ubuntu version for github tests (#62)

* Llm api update (#59)

* getting rid of .invoke()

* adding an AbstractChatModel

* changing chat_api structure

* Reproducibility again (#61)

* core functions

* switch to dask

* removing joblib dependency and adding dask

* fixing imports

* handles multiple backends

* ensure asyncio loop creation

* more tests

* setting dashboard address to None

* minor

* Finally found a way to make it work

* initial reproducibility files

* Seems to be superflus

* adding a reproducibility journal

* minor update

* more robust

* adding reproducibility tools

* fix white listing

* minor

* minor

* minor

* minor

* minor fix

* more tests

* more results yay

* disabling this test

* update

* update

* black

* maybe fixing github workflow ?

* make get_git_username great again

* trigger change

* new browsergym

* GPT-4o result (and new comment column)

* Seems like there was a change to 4o flags, trying these

* minor comment

* better xray

* minor fix

* addming a comment field

* new agent

* another test with GPT-4o

* adding llama3 from openrouter

* fix naming

* unused import

* new summary tools and remove "_args" from columns in results

* add Llama

* initial code for reproducibility agent

* adjust inspect results

* infer from benchmark

* fix reproducibility agent

* prevent the repro_dir to be an index variable

* updating repro agent stats

* Reproducibility agent

* instructions to setup workarena

* fixing tests

* handles better a few edge cases

* default progress function to None

* minor formatting

* minor

* initial commit

* refactoring with Study class

* refactor to adapt for study class

* minor

* fix pricy test

* fixing tests

* tmp

* print report

* minor fix

* refine little details about reproducibility

* minor

* no need for set_temp anymore

* sanity check before running main

* minor update

* minor

* new results with 4o on workarena.l1

* sharing is caring

* add llama to main.py

* new hournal entry

* lamma 3 70B

* minor

* typo

* black fix (wasn't configured)

---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>

* version bump

---------

Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* Make share=TRue into a environment variable, disabled by default for security

* fix floating point issue with std_reward in agent xray

* Update src/agentlab/analyze/inspect_results.py

* Update src/agentlab/analyze/agent_xray.py

---------

Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* added tmlr definitive config (#71)

* downgrading gradio version (#77)

* Study refactor (#73)

* adapting to new Benchmark class

* fixing tests

* fix tests

* typo

* not ready for gradio 5

* study id and a few fixes

* fixing pricy tests

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* adding message class and updating generic agent accordingly (#68)

* adding message class and updating generic agent accordingly

* updating tests

* Reproducibility test before message class

* Adding inspect_result.ipynb to reprod white list

* Reproducibility test after message class

* L1 before message class

* L1 after message class

* added append as method to the Discussion class, to make it totally similar to a list

* changed to_markdown behavior

* updated most_basic_agent

* updated ReproAgent

* Update src/agentlab/analyze/agent_xray.py

* format

* new journal entry

* immutable as default kwarg

* removing __add__ and __radd__

* added deprecation warning

* updating tests

* version bump

* Updating generic_agent to fit use BGym's goal_object (#83)

* updating generic agent to goal_object

* fixing image markdown display

* updating tests

* fixing intruction BaseMessage

* added merge text in discussion

* added merge to discussion class

* added tests

* Minor revert (#86)

* minor revert

* revert tests too

* Add tabs (#84)

* add tabs

* make sure it's not computed if not visible

* Fix reproduce study (#87)

* add tabs

* this workaround is worst

* bug fix

* fix reproduce study

* make sure it's not computed if not visible

* upgrading gradio dependency (#88)

* bgym update (#90)

* Workarena TMLR experiments (#89)

* new entry

* adding llm configs

* new journal entries

* handling sequntial in VWA (#91)

* handling sequntial in VWA

* enable comments

* format

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* Tmlr workarena (#92)

* adding llm configs

* new L1 entries

* tmp

* reformat

* adding assistantbench to reproducibility_util.py

* gitignore (#97)

* Vision fix (#105)

* changing content name

* Update src/agentlab/llm/llm_utils.py

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* L2 tmlr (#93)

* adding llm configs

* L2 entries

* claude L3

* claude vision support

* miniwob results

* 405b L1 entry

* Replacing Dask with Ray (#100)

* dask-dependencies

* minor

* replace with ray

* adjust tests and move a few things

* markdown report

* automatic relaunch

* add dependencies

* reformat

* fix unit-test

* catch timeout

* fixing bugs and making things work

* adress comments and black format

* new dependencies viewer

* Update benchmark to use visualwebarena instead of webarena

* Fix import and uncomment code in get_ray_url.py

* Add ignore_dependencies option to Study and _agents_on_benchmark functions

* Update load_most_recent method to include contains parameter

* Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark

* Refactor backend preparation in Study class and improve logging for ignored dependencies

* finallly some results with claude on webarena

* Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs

* black

* ensure timeout is int (For the 3rd time?)

* Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function

* black

* Change parallel backend from "joblib" to "ray" in run_experiments function

* Update src/agentlab/experiments/study.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Update src/agentlab/analyze/inspect_results.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* switching to 2 for loops in _agents_on_benchmark (#107)

* yet another way to kill timedout jobs (#108)

* Fix prompt formatting in Observation and add static method to Study class (#110)

* Bug fix (#111)

* Fix prompt formatting in Observation and add static method to Study class

* Update gradio version to 5.5 to fix DataFrame scrolling issue

* Fixing openrouter pricing rate limit (#112)

* Update unit_tests.yml (#101)

* request is done once and then reused

* Patching minor stuff (#69)

* fixing sample_std for single experience

* making gradio shared server non default

* missing requirement for xray

* Improve agent xray app (#70)

* 0.2.2 Release (#67)

* downgrading ubuntu version for github tests (#62)

* Llm api update (#59)

* getting rid of .invoke()

* adding an AbstractChatModel

* changing chat_api structure

* Reproducibility again (#61)

* core functions

* switch to dask

* removing joblib dependency and adding dask

* fixing imports

* handles multiple backends

* ensure asyncio loop creation

* more tests

* setting dashboard address to None

* minor

* Finally found a way to make it work

* initial reproducibility files

* Seems to be superflus

* adding a reproducibility journal

* minor update

* more robust

* adding reproducibility tools

* fix white listing

* minor

* minor

* minor

* minor

* minor fix

* more tests

* more results yay

* disabling this test

* update

* update

* black

* maybe fixing github workflow ?

* make get_git_username great again

* trigger change

* new browsergym

* GPT-4o result (and new comment column)

* Seems like there was a change to 4o flags, trying these

* minor comment

* better xray

* minor fix

* addming a comment field

* new agent

* another test with GPT-4o

* adding llama3 from openrouter

* fix naming

* unused import

* new summary tools and remove "_args" from columns in results

* add Llama

* initial code for reproducibility agent

* adjust inspect results

* infer from benchmark

* fix reproducibility agent

* prevent the repro_dir to be an index variable

* updating repro agent stats

* Reproducibility agent

* instructions to setup workarena

* fixing tests

* handles better a few edge cases

* default progress function to None

* minor formatting

* minor

* initial commit

* refactoring with Study class

* refactor to adapt for study class

* minor

* fix pricy test

* fixing tests

* tmp

* print report

* minor fix

* refine little details about reproducibility

* minor

* no need for set_temp anymore

* sanity check before running main

* minor update

* minor

* new results with 4o on workarena.l1

* sharing is caring

* add llama to main.py

* new hournal entry

* lamma 3 70B

* minor

* typo

* black fix (wasn't configured)

---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>

* version bump

---------

Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* Make share=TRue into a environment variable, disabled by default for security

* fix floating point issue with std_reward in agent xray

* Update src/agentlab/analyze/inspect_results.py

* Update src/agentlab/analyze/agent_xray.py

---------

Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* added tmlr definitive config (#71)

* downgrading gradio version (#77)

* Study refactor (#73)

* adapting to new Benchmark class

* fixing tests

* fix tests

* typo

* not ready for gradio 5

* study id and a few fixes

* fixing pricy tests

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* adding message class and updating generic agent accordingly (#68)

* adding message class and updating generic agent accordingly

* updating tests

* Reproducibility test before message class

* Adding inspect_result.ipynb to reprod white list

* Reproducibility test after message class

* L1 before message class

* L1 after message class

* added append as method to the Discussion class, to make it totally similar to a list

* changed to_markdown behavior

* updated most_basic_agent

* updated ReproAgent

* Update src/agentlab/analyze/agent_xray.py

* format

* new journal entry

* immutable as default kwarg

* removing __add__ and __radd__

* added deprecation warning

* updating tests

* version bump

* Updating generic_agent to fit use BGym's goal_object (#83)

* updating generic agent to goal_object

* fixing image markdown display

* updating tests

* fixing intruction BaseMessage

* added merge text in discussion

* added merge to discussion class

* added tests

* Minor revert (#86)

* minor revert

* revert tests too

* Add tabs (#84)

* add tabs

* make sure it's not computed if not visible

* Fix reproduce study (#87)

* add tabs

* this workaround is worst

* bug fix

* fix reproduce study

* make sure it's not computed if not visible

* upgrading gradio dependency (#88)

* bgym update (#90)

* Workarena TMLR experiments (#89)

* new entry

* adding llm configs

* new journal entries

* handling sequntial in VWA (#91)

* handling sequntial in VWA

* enable comments

* format

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* Tmlr workarena (#92)

* adding llm configs

* new L1 entries

* tmp

* reformat

* adding assistantbench to reproducibility_util.py

* gitignore (#97)

* Vision fix (#105)

* changing content name

* Update src/agentlab/llm/llm_utils.py

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* L2 tmlr (#93)

* adding llm configs

* L2 entries

* claude L3

* claude vision support

* miniwob results

* 405b L1 entry

* Replacing Dask with Ray (#100)

* dask-dependencies

* minor

* replace with ray

* adjust tests and move a few things

* markdown report

* automatic relaunch

* add dependencies

* reformat

* fix unit-test

* catch timeout

* fixing bugs and making things work

* adress comments and black format

* new dependencies viewer

* Update benchmark to use visualwebarena instead of webarena

* Fix import and uncomment code in get_ray_url.py

* Add ignore_dependencies option to Study and _agents_on_benchmark functions

* Update load_most_recent method to include contains parameter

* Update load_most_recent method to accept contains parameter and add warning for ignored dependencies in _agents_on_benchmark

* Refactor backend preparation in Study class and improve logging for ignored dependencies

* finallly some results with claude on webarena

* Add warnings for Windows timeouts and clarify parallel backend options; update get_results method to conditionally save outputs

* black

* ensure timeout is int (For the 3rd time?)

* Refactor timeout handling in context manager; update test to reduce avg_step_timeout and rename test function

* black

* Change parallel backend from "joblib" to "ray" in run_experiments function

* Update src/agentlab/experiments/study.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Update src/agentlab/analyze/inspect_results.py

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Refactor logging initialization and update layout configurations in dependency graph plotting; adjust node size and font size for better visualization

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* switching to 2 for loops in _agents_on_benchmark (#107)

* yet another way to kill timedout jobs (#108)

* request is done once and then reused

* switched to caching original function bc it doesnt break to tests

* added a catch for some openrouter under-the-hood error

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>

* updating max prompt configs, vision support (#109)

* Cross-product deepcopy fix (#106)

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* slugify study_name (#114)

* Improve timeout handling in task polling logic

* Add method to override max_steps in Study class

* add support for tab visibility in observation flags and update related components

* fix tests

* Fix sorting bug.
 improve directory content retrieval with summary statistics

* fix test

* black

* Weblinx results (#104)

* adding weblinx results

* adding old weblinx results

---------

Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* Max new tokens fix (#118)

* Lower max_new_tokens for OpenAI models

* updating configs

---------

Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: ThibaultLSDC <thibault.de.chezelles@gmail.com>

* version bump (#119)

* fix format (#120)

* Clean pipeline (#117)

* yet another way to kill timedout jobs

* Improve timeout handling in task polling logic

* Add method to override max_steps in Study class

* add support for tab visibility in observation flags and update related components

* fix tests

* black

* Improve timeout handling in task polling logic

* yet another way to kill timedout jobs (#108)

* Add method to override max_steps in Study class

* add support for tab visibility in observation flags and update related components

* fix tests

* black

* black

* Fix sorting bug.
 improve directory content retrieval with summary statistics

* fix test

* black

* tmp

* add error report, add cum cost to summary and ray backend by default

* black

* fix test (chaing to joblib backend)

* black

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

---------

Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Léo Boisvert <leo.boisvert@hotmail.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants