[PAPER REVIEW] WebArena

SOOH·2024년 5월 12일

LLMs

목록 보기

2/3

WEBARENA: A REALISTIC WEB ENVIRONMENT FOR BUILDING AUTONOMOUS AGENTS

Background

기존 agent들은 이전에 생성된 environments안에서, 즉 real-world scenarios에 disconnect된 환경에서 생성되고 테스트되고 있었음. 따라서 highly realistic and reproducible한 language-guided agent를 만드는 연구 진행하게됨. 실제 웹 환경에서 4가지 도메인의 task를 perform하도록 함(e-commerce, social forum discussions, collaborative software development, content management)

해당 benchmark task에 대해 GPT-4는 14.41%, human은 78.24%의 성능을 보임. 이 결과는 robust agent의 develop에 대한 필요성을 이야기함. WebArena는 이러한 real-life task에 대해서 measure하는데 사용될 수 있음.

이러한 한계들은 simulated environments와 real world 사이의 불일치로 이어짐 → AI agents의 generalizability 부족의 결과!

WebArena

WEBSITES AS AN ENVIRONMENT FOR AUTONOMOUS AGENTS

a realistic and reproducible web environment designed to facilitate the development of autonomous agents capable of executing tasks.

WebArena is a standalone, self-hostable web environment for building autonomous agents.

Controlling agents through high-level natural language

$E= ⟨S, A, O, T ⟩$ $S$ : state space, $A$ :action space

natural language intent $i$ ← POMDP ( partially observable Markov decision process )

현재 상태인 $s_t$ 에서 어떠한 동작(action)인 $a_t$ 를 취하면, 새로운 상태 $s_{t+1}$ 이 된다. 이 새로운 상태에서는 이에 상응하는 observation인 $o_{t+1}$ 를 확인 가능하다. task execution(중단)의 상태에서 success를 measure하기 위해서는 reward function인 $r(a,s)$ 를 사용한다. 여기서 $a$ 는 모든 일련의 동작들을, $s$ 는 중간 상태들을 포함하는 모든 상태들을 의미한다. 이 reward funciton은 그 상태의 변화가 사용자의 의도의 예상값과 align하는지 평가한다.

SOOH

이전 포스트

[PAPER REVIEW] IRCoT

다음 포스트