evaluating single aspect
multi-aspect evaluation
Needed supervised training and manual annotation
using LLM to achieve multi-aspect, customized and training-free evaluation
task specification
aspect definition
deponstrated samples
GPT to calculatehow likely the text could be generated based on the evaluation protocol
lexical overlap-based
embedding-based
ICL
CoT Reasoning
Zero-shot instruction
BARTScore needs a fine-tuning step
GPTScore > BARTScore
: weight of the token at position (in this work, it is treated equally)
: prompt template that defines the evaluation protocol
Tasks
Dialogue Response Generation
Text Summarization
Data-to-Text
Machine Translation
37 Datasets
22 Evaluation Aspects
ROUGE
PRISM
BERTScore
MoverScore
DynaEval
BARTScore
GPTScore
sampled 40 sample for each summarization dataset
sampled 100 samples for dialogue response generation and data-to-text
based on bootstrapping
Evaluator with instruction significantly improves the performance
GPT3 / FT5 based models + instructions > supervised method
IDM > IST > VAL
IDM > finetuned model
the choice of examples impacts the performance a lot
IDM + GPT3 small size family > large sized model
GPT3-d01 >> GPT3-d03
GPT3 based model demonstrate storonger generalization ability
IST improved the performance
IDM > IST
GPT3-c01 achieved comparable performance with d01 and d03
demonstration improves the performance
there is an upper bound on the performance gains
if there is only a few samples, small models are prone to performance degradation
Not included GPT 3.5 and GPT 4
the reason why d03 is worse than d01 is unclear as it is not open source
API cost issue