In-context Learning important emergent capability of LLM
After GPT3, ICL has been the subject of a growing body of research
Many application demand the use of ML model for decision making
Decision making agentsmust posses
recent papers about ICRL
Deploying LLM to solve multi-armed bandit problem
evaluate the in-context behavior
tested GPT-3.5, GPT-4, LLaMA2
only single configuration (prompt + model) showed satisfactory exploratory behavior
all failure is due to suffix failure (fails to select the best arm even once after some initial rounds)
GPT-4 with basic prompt failed in over 60%
another failure is for LLM to behave uniformly
successed configuration
GPT-4 + enhanced prompt
SOTA model has capability to robustly explore if prompt is designed carefully
but it may fail in complex environments
In-Context bandit learning is hard
identify surrogate statistics as diagnostics for long-term exploration failure
used MAB variant, Stochastic Bernoulli bandits
reward for each arm is not chosen by the agent are not revealed exploration is necessary to identify the best arm
focus on MAB with the best arm has mean reward and all other arms has mean reward
set and as 'hard' instance
set and as 'easy' instance
scenario
framing
history
requested final answer
method
basic prompt is buttons / neutral framing / raw history / return only arm / no CoT
two standard MAB algorithms
Greedy (doesn't explore and finally fail)
no parameter tuning
1000 replicates for each baseline and each MAB instance
time horizon
replicates for each LLM configuration and bandit instance
single experiment on GPT-4 with basic configuration for for robustness check
in detail
GPT-3.5
GPT-4
GPT-4 (additional aubustness check)
LLaMA2
LLM queries for each config and MAB instance
Both exploration failures are less frequent in easier MAB instances
To cover extremely large prompt space, use small and large ,
and do not provide enough statistical power to distinquish between successful and unsuccessful methods
rely on surrogate statistics which can be detected in current moderate scale rather than scale up
All but one LLM config failed to converge to the best arm with significant probability
Suffix Failures
Uniform-like failures
the only one exception is GPT-4 with
Fig 3 : summarize the main set of experiments (hard MAB instance)
two surrogate statistics
show another statistic GreedyFrac (how similar a method is to GREEDY)
only GPT-4 with follows the baseline TS and UCB
most of the LLM configs exhibit bimodal behavior
Consistent with this, suffix failures occurred many times
suggests long-term failure to explore
For an experiment replicate and round
basic config (GPT-4-) in Fig1 (top) for T = 500, Fig 5 for GPT-4 for T = 100
bimodal behavior is shown in left plot
LLMs have much higher SuffFailFreq than UCB and TS
as T = 100 is not enough, suffix failures are not fully reflected in Fig 5 (right)
in Fig 1, suffix failure makes the larger differences in reward for large
in Fig 3 (left), 3 GPT-4 configurations avoid suffix failures
two of these shows the uniform-like failures (exploitation failure)
For an experiment replicate and round ,
for LLMs, MinFrac() doesn't decrease over time and stays larger than that of baselines
for two GPT-4 that avoid suffix failures and get uniform-like failures, (BNRND, BSSCD) both used distributional output
MinFrac doesn't decrease while baselines does
in longer , it has much lower reward than baselines
all LLMs except GPT-4- exhibit either a suffix failure or a uniform failure for hard MAB
other experiments have similar result
summary
GPT-4 performed much better than GPT-3.5
LLaMA 2 performed much worse
all LLMs are sensitive to small changes in the prompt design
GPT-4-
ran this config on hard MAB with and + as an ablation
worked well in longer T
showed non-trivial fraction of suffix failures (Fig 1(b))
Fig 8
Fig 9
per-round decisions with GPT-3.5
each experiment considers a particular distribution of bandit histories
sampled 50 histories of length
tracked two statistics for each agent
uniform sampled data + UCB and TS sampled data
per-round performance of both the LLMs and baselines is very sensitive to data source
is too greedy, is too uniform
and fall within the reasonable range by baselines while they failed in longitudinal experiments
hard to assess whether LLM agents are too greedy or too uniform based on per-round decisions
Experiment with other prompts
Experiment with few-shot prompting
Train the LLM to use auxilary tools
simple MAB provides a clean and controllable setup
in more complex RL and decision making, similar failures also occur
the solution for MAB may not generalize to more complex settings
even for linear contextual bandits, this approach may not be applicable without a substancial intervention
ICSL이 아닌 ICRL의 관점에서 LLM이 어느 정도의 지식을 가지고 있는지 확인하는 논문. 단순한 문제이긴 하지만 요즘 LLM Agent에 대해서도 연구가 이뤄지는 만큼 Baseline이 되긴 할듯. RL에서는 단순한 문제지만 GPT-4에서 프롬프팅을 섞어야 해결이 가능할 만큼 LLM 능력으로는 접근하기 쉽지 않은듯