[PAPER REVIEW] PS Prompting

SOOH·2024년 5월 12일

LLMs

목록 보기

3/3

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Zero-shot-CoT의 단점

Calculation errors
missing-step errors
this occur when some intermediate reasoning step(s) is missed-out (especially complex&multi-step reasoning)
→ SOL) PS prompting
semantic misunderstanding errors

[ to solve 2. the missing-step errors ]

“Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step”

[ to solve 1. calculation errors and improve the quality of generated reasoning steps ]

PS+ prompting(more detailed instruction)

“extract relevant variables and their corresponding numerals”

“calculate intermediate results (pay attention to calculation and commonsense)”

arithmetic reasoning에서 PS+ Prompting은 8-shot CoT prompting과 비슷한 성능을 보임

make an inference using the proposed prompting template → generate the reasoning process and the answer to a problem
1. subtask를 만들어내고 accomplish하도록 하기
  → “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step” in template of Answering ( A:[T] )
2. LLM이 계산에 더 집중하며, 중간과정의 결과를 정확하게 도출해낼 수 있도록 하기
  → “pay attention to calculation” + “extract relevant variables and their corresponding numerals”
  
  💡 [ hypothesis ]
  if the LLMs leave out the relevant and important variables, it is more likely to miss out relevant reasoning steps.
3. LLM의 reasoning step에 대한 성능 향상을 위해..
  → “calculate intermediate results”
extract the answer for evaluation by using the answer extraction prompting

→ ”Therefore, the answer # is”

[ Benchmarks ]

evaluate PS-prompting on the !ten! benchmark datasets from !three! categories of reasoning problems.

Arithmetic Reasoning
1. GSM8K dataset : high quality linguistically diverse grade school math word problems (created by human problem writers)
2. SVAMP benchmark : one-unknown arithmetic word problems for up-to-4 grade level students by making simple changes to a set of problems (from another existing dataset)
3. MultiArith dataset : math word problems requiring multiple reasoning steps and operations
4. AddSub dataset : addition and subtraction arithmetic word problems
5. AQUA dataset : algebraic word problems with natural language rationales
6. SingleEq dataset : single-equation grade-school algebra word problems with multiple math operations over non-negative rational numbers and one variable
Commonsense Reasoning
1. CSQA benchmark dataset : multiple-choice questions that require different types of commonsense knowledge to obtain the correct answers
2. StrategyQA benchmark dataset : questions requiring multi-step reasoning but the reasoning steps are not given.
Symbolic Reasoning
1. Last Letter Concatenation dataset : questions requiring the last letters of words in a name to be concatenated
2. Coin Flip dataset : questions on whether a coin is still heads up after it is flipped or not flipped based on steps given in the questions

[ Baselines ]

Zero-shot baselines
- Zero-shot CoT(“Let’s think step by step”)
- Zero-shot PoT(uses LLM mainly OpenAI Codex to generate a Python pro- gram and then derive an answer by executing the generated program on a Python interpreter)
Few-shot with manual demonstrations

Manual-CoT creates eight hand-crafted examples as demonstrations.
Few-shot with automatic demonstrations

Auto-CoT automatically selected examples by clustering with diversity and generates reasoning chains using zero-shot-CoT to construct demonstrations.