Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models
https://arxiv.org/pdf/2305.04091.pdf
Background
Zero-shot-CoT의 단점
- Calculation errors
- missing-step errors
this occur when some intermediate reasoning step(s) is missed-out (especially complex&multi-step reasoning)
→ SOL) PS prompting
- semantic misunderstanding errors
관련 개념
- pre-trained language models (PTMs) ↔ LLMs :no access to model parameters
PS Prompting
[ to solve 2. the missing-step errors ]
“Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step”
- devise a plan to divide the entire task into smaller subtasks
- carrying out the subtasks according to the plan.
[ to solve 1. calculation errors and improve the quality of generated reasoning steps ]
PS+ prompting(more detailed instruction)
“extract relevant variables and their corresponding numerals”
“calculate intermediate results (pay attention to calculation and commonsense)”
arithmetic reasoning에서 PS+ Prompting은 8-shot CoT prompting과 비슷한 성능을 보임
[ Two steps to Zero-shot PS prompting ]
-
make an inference using the proposed prompting template → generate the reasoning process and the answer to a problem
- subtask를 만들어내고 accomplish하도록 하기
→ “Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step” in template of Answering ( A:[T] )
- LLM이 계산에 더 집중하며, 중간과정의 결과를 정확하게 도출해낼 수 있도록 하기
→ “pay attention to calculation” + “extract relevant variables and their corresponding numerals”
💡 [ hypothesis ]
if the LLMs leave out the relevant and important variables, it is more likely to miss out relevant reasoning steps.
- LLM의 reasoning step에 대한 성능 향상을 위해..
→ “calculate intermediate results”
-
extract the answer for evaluation by using the answer extraction prompting
→ ”Therefore, the answer # is”
[ Experiments ]
[ Benchmarks ]
evaluate PS-prompting on the !ten! benchmark datasets from !three! categories of reasoning problems.
- Arithmetic Reasoning
- GSM8K dataset : high quality linguistically diverse grade school math word problems (created by human problem writers)
- SVAMP benchmark : one-unknown arithmetic word problems for up-to-4 grade level students by making simple changes to a set of problems (from another existing dataset)
- MultiArith dataset : math word problems requiring multiple reasoning steps and operations
- AddSub dataset : addition and subtraction arithmetic word problems
- AQUA dataset : algebraic word problems with natural language rationales
- SingleEq dataset : single-equation grade-school algebra word problems with multiple math operations over non-negative rational numbers and one variable
- Commonsense Reasoning
- CSQA benchmark dataset : multiple-choice questions that require different types of commonsense knowledge to obtain the correct answers
- StrategyQA benchmark dataset : questions requiring multi-step reasoning but the reasoning steps are not given.
- Symbolic Reasoning
- Last Letter Concatenation dataset : questions requiring the last letters of words in a name to be concatenated
- Coin Flip dataset : questions on whether a coin is still heads up after it is flipped or not flipped based on steps given in the questions
[ Baselines ]
-
Zero-shot baselines
- Zero-shot CoT(“Let’s think step by step”)
- Zero-shot PoT(uses LLM mainly OpenAI Codex to generate a Python pro- gram and then derive an answer by executing the generated program on a Python interpreter)
-
Few-shot with manual demonstrations
Manual-CoT creates eight hand-crafted examples as demonstrations.
-
Few-shot with automatic demonstrations
Auto-CoT automatically selected examples by clustering with diversity and generates reasoning chains using zero-shot-CoT to construct demonstrations.