[simple LLM] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

James Kim·2024년 4월 28일

LLM의 능력은 강화학습으로 더욱 강해진다고 하지만 , 이를 평가할 지표는 부족하다. MMLU같은 지표는 사람의 perception과 괴리가 있다.

보이는 것은 vicuna , llama2 13B 예시로 둘 다 MMLU에서는 비슷한 성능을 보이지만 human preference로는 차이가 있다. 따라서 보다 robust하고 automated한 평가 방법이 필요하다
그래서 우리는 2가지 bench를 제시하는데
MT-bench
chatbot arena

James Kim

이전 포스트

[simple LLM] Self-Rewarding Language Models

다음 포스트

[simple LLM] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

[simple LLM] Self-Rewarding Language Models

[ASR study] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

0개의 댓글