Rank | Model Name | Raw Score | Estimated Grade (as of 2025.11.18) |
---|---|---|---|
🥇1st | o1-Preview | 97 | 1st Grade |
🥈2nd | o1-mini | 78 | 4th Grade |
🥉3rd | gpt-4o | 75 | 4th Grade |
4th | gpt-4o-mini | 59 | 5th Grade |
5th | gpt-3.5-turbo | 16 | 8th Grade |
o1-preview has achieved a score of 97 on the 2025 Korean CSAT (College Scholastic Ability Test) for the Korean Language section! This remarkable performance, with only one question wrong, demonstrates a significant advancement in LLM capabilities. Evaluating LLMs with the most authoritative Korean language assessment, the CSAT, revealed that the o1-preview model is nearly perfect in understanding Korean language.
Previously, gpt-4o had achieved an average grade of 3rd grade and a top score of 86 points in the 10-year CSAT leaderboard. This highlighted the gap between LLMs and human language proficiency.
However, the o1-preview model scored 88 points and ranked in the 1st grade in the 2024 CSAT, demonstrating parity with highly proficient human performance. Now, with a score of 97 in the 2025 CSAT, it suggests that the era when LLMs surpass human language abilities may not be far off.
The benchmark originated from the Nomadamas Project, which aimed to achieve the highest score in the Korean CSAT language section using LLMs. With the advent of GPT-4 and advancements in prompt engineering, the project sought to tackle the challenging questions created by KICE (Korea Institute for Curriculum and Evaluation).
Nomadamas’ project experimented with extensive prompt engineering to create optimal prompts. This year's focus shifted to comparing the performance of various LLMs, culminating in the creation of the Korean CSAT LLM Leaderboard.
Collected data from the Korean CSAT from 2015 to 2024, extracting text from the CSAT PDFs and organizing them into:
Specific references like [A] or [B] were enclosed in parentheses, while tables or images were replaced with written descriptions.
The JSON files generated were parsed into QA and corpus datasets optimized for AutoRAG.
Details on dataset construction can be found here.
AutoRAG is a tool that automatically optimizes RAG pipelines for specific datasets. It supports:
A mini-test feature was also added to allow users to check the performance of models on 2023 CSAT data. Try the mini-test!
The models' answers were evaluated for accuracy against a predefined answer key. Adherence to the standardized answer format was enforced, ensuring objective scoring.
Leaderboard rankings were determined based on the average standardized scores, reflecting each year's difficulty.