"optimize the parallel function calling performance of LLMs"
pioneering work in function calling
ReAct Architecture
ReAct Limitation
However, scaling ReAct approach for more complex applications requires considerable optimizations.
This is due to the sequential nature of ReAct,
where it executes function calls and reasons about their observations one after the other.
limitation and effect 1) agent systems that extend ReAct may lead to inefficiencies in latency and cost, due to the sequential function calling and repetitive LLM invocations for each reasoning and action step
limitation and effect 2) while dynamic reasoning about the observations has benefits in certain cases, concatenating the outcomes of intermedi- ate function calls could disrupt the LLM’s execution flow, potentially reducing accuracy (Xu et al., 2023)
Common failure cases include repetitive invocation of the same function, which is also highlighted in the original paper (Yao et al., 2022),
LLM Compiler's inspiration
classical compilers,
where optimizing instruction executions in traditional programming language has been extensively explored.
key optimizaiton technique in compilers
: identifying instructions that can be executed in parallel and effectively managing their dependencies.
(fig 2)
1. function calling planner (LLMs)
identifies an execution flow.
generates a sequence of tasks and their dependencies
example
Planner's Role
LLM의 reasoning capability를 사용해서,
automatically identify
the necessary tasks,
their input arguments,
their inter-dependencies
essentially forming a directed acyclic graph(DAG) of task dependencies.
placeholder variable 사용
If a task is dependent on preceding task,
it incorporates a placeholder variable,
such as $1 in Task 3 of Fig.2
which will later be substituted with the actual output from the preceding task.
즉, natural language inputs를 받아서, decompose tasks
pre-defined prompt
pre-defined prompt that guides it on how to create dependency graphs and to ensure correcy syntax
users are only required to provide
tool definitions
tool descriptions
argument specifications.
This is essentially the same requirement as other frameworks like ReAct and OpenAI function calling.
optional in-context examples
examples of how the Planner should behave.
examples illustrating expected inter-task dependencies for certain queries.
These examples can assist the Planner LLM understand how to use various tools and generate the appropriate depen- dency graph for incoming inputs in the correct format.
- help the planner to better understand the rules.
2. task fetching unit
dispacthes the function calls in parallel
replace placeholder variables with the actual outputs from preceding tasks.
inspiration
example
여기선 LLM 안씀
This can be implemented with a simple fetching and queuing mechanism without a dedicated LLM.
3. executor
ashynchronoulsy executes the tasks
fetched from the Task Fetching Unit,
using the associated functions(tools) in parallel.
as the Task Fetching Unit gurantees that all the tasks dispatched to the Executor are independent,
it can simply execute them concurrently.
euipped with user-provided tools
and delegate the task to the associated tool.
tool:
each task has dedicated memory to store its intermediate outcomes,
similar to what typical sequential frameworks do when aggregating observations as a single prompt(ReAct)
Upon completion of the task, the final result are forwarded as input to the tasks dependent on them.
User input: “How much does Microsoft’s market cap need to increase to ex- ceed Apple’s market cap?”
LLM first needs to conduct web searches for both companies’ market caps, followed by a division operation.
existing frameworks(including ReAct)
perform these tasks sequentially.
LLMCompiler
perform these tasks in parallel.
key question
how to automatically determine which tasks are parallelizable and which are interdependent.
so we can orchestrate the execution of the different tasks accordingly.
➡️ three components로 achieve.
Dynamic Replanning
intermediate results에 따라서, execution graph를 수정해야 할 필요가 있다.
programming에서 branching 하듯이, LLM function calling에서도, dynamic execution이 가능하다.
simple branching의 경우 (e.g., if-else statements)
하지만, complex branching의 경우
LLMs: GPT-3.5-turbo, LLaMA-2
비교 대상(baseline): ReAct
Benchmark
Dependency ❌
Dependency ⭕, dependency graph (static)
**Dependency O, dependency graph (dynamically)
Game of 24(Yao et al. 2023b)
for evaluating capability in dynamic re-planning
which is achieved through a feedback loop from the Executor back to our Function Calling Planner.
task to generate 24 using a set of four numbers and basic arithmetic operations.
Game of 24 requires repeated replanning based on the intermediate results
2× speedup compared to Tree-of-Thoughts
WebShop
가장 간단한 시나리오.
using a tool repeatedly for independent tasks
such as conducting parallel searches or analyses to gather information on different topics
embarrassingly parallel patterns
Dataset: HotpotQA, Movie Recommendation
ReACT는 suquential하게 실행할 수 밖에 없음.
Experimental setups
Accuracy and Latency
Observation 1) ReAct consistently achieves lower accuracy compared to OpenAI parallel function calling and LLMCompiler.
React^십자가 와 비교해서, LLMCompiler가
Cost
조금 더 복잡한 task.
Fig 3 (b)와 (c)
completing the task requires using two tools
(i.e., search and math tools),
with the second tool’s argument depending on the result of the first tool’s output.
dependency graphs can be determined statically.
Dataset: ParallelQA
ex) “If Texas and Florida were to merge and become one state, as well as California and Michigan, what would be the largest population density among these 2 new states?”
Accuracy and Latency
Tab.1.
LLMCompiler의 Failure Case
Cost
Tab 2.
this efficiency stems from LLMCompiler's reduced frequency of LLM invocations,
Success Rate and Latency