o1-preview Achieves 97 Points on the 2025 Korean CSAT Language Section!

minsing-jin·2024년 11월 18일
0

영어홍보글

목록 보기
1/2

GPT 2025 CSAT LLM Benchmark Evaluation Results

RankModel NameRaw ScoreEstimated Grade (as of 2025.11.18)
🥇1sto1-Preview971st Grade
🥈2ndo1-mini784th Grade
🥉3rdgpt-4o754th Grade
4thgpt-4o-mini595th Grade
5thgpt-3.5-turbo168th Grade

o1-preview has achieved a score of 97 on the 2025 Korean CSAT (College Scholastic Ability Test) for the Korean Language section! This remarkable performance, with only one question wrong, demonstrates a significant advancement in LLM capabilities. Evaluating LLMs with the most authoritative Korean language assessment, the CSAT, revealed that the o1-preview model is nearly perfect in understanding Korean language.

Previously, gpt-4o had achieved an average grade of 3rd grade and a top score of 86 points in the 10-year CSAT leaderboard. This highlighted the gap between LLMs and human language proficiency.

However, the o1-preview model scored 88 points and ranked in the 1st grade in the 2024 CSAT, demonstrating parity with highly proficient human performance. Now, with a score of 97 in the 2025 CSAT, it suggests that the era when LLMs surpass human language abilities may not be far off.


🪑 Background and Purpose of the Benchmark

The benchmark originated from the Nomadamas Project, which aimed to achieve the highest score in the Korean CSAT language section using LLMs. With the advent of GPT-4 and advancements in prompt engineering, the project sought to tackle the challenging questions created by KICE (Korea Institute for Curriculum and Evaluation).

Nomadamas’ project experimented with extensive prompt engineering to create optimal prompts. This year's focus shifted to comparing the performance of various LLMs, culminating in the creation of the Korean CSAT LLM Leaderboard.

Key Objectives:

  1. Share benchmark data comparing human performance and LLM performance.
  2. Utilize Korea’s most authoritative dataset for evaluating Korean language proficiency, curated by KICE.
  3. Prevent data leakage by using updated, annual CSAT language benchmark datasets.

🧪 Experimental Methods

1️⃣ Dataset Compilation and Parsing

Collected data from the Korean CSAT from 2015 to 2024, extracting text from the CSAT PDFs and organizing them into:

  • Questions
  • Passages
  • Answer choices
  • Answer keys

Specific references like [A] or [B] were enclosed in parentheses, while tables or images were replaced with written descriptions.

The JSON files generated were parsed into QA and corpus datasets optimized for AutoRAG.

Details on dataset construction can be found here.


2️⃣ Benchmarking with AutoRAG

AutoRAG is a tool that automatically optimizes RAG pipelines for specific datasets. It supports:

  • Easy access to various models via YAML configurations.
  • The GOAT function, which enables swapping prompts seamlessly, making it ideal for the CSAT benchmark leaderboard.

A mini-test feature was also added to allow users to check the performance of models on 2023 CSAT data. Try the mini-test!


3️⃣ Evaluation

The models' answers were evaluated for accuracy against a predefined answer key. Adherence to the standardized answer format was enforced, ensuring objective scoring.


4️⃣ Scoring and Leaderboard Composition

Leaderboard rankings were determined based on the average standardized scores, reflecting each year's difficulty.


Future Directions

  1. Ongoing Benchmarking: With sufficient GPU resources and inference budgets, more models will be benchmarked and added to the leaderboard.
  2. Annual Updates: If resources allow, a new dataset and benchmark will be created every year to assess LLM performance on the latest CSAT questions.
profile
why not? 정신으로 맨땅에 헤딩하고 있는 코린이

0개의 댓글