Large Scale Korean legal AI benchmark
pre-trained LCUBE (decoder only, based on GPT-2)
Three-tiered (District, High and the Supreme Court)
rooted in civil law system (vs. common law system)
Structure of Korean Precedent
meta information
gist of claim from plaintiffs in a civil case
ruling
reasoning
facts
claims
reasoning
decisions
The Redaction Process
Precedent Disclosure Status
Document Images and PDF precedents are available
Preprocessing pipeline
JSON format
fact + gist of claim + degrees of claim acceptance
claim acceptance degree
Level 1 (rejection / partial approval / full approval)
Level 2 (13 categories)
mt5-small + prompt-tuning for parsing expression (money provider / receiver / amount / litigation cost)
Nvidia A6000, RTX3090 or RTX6000
lr 3e-5 to 1e-4
batch 8 to 60, AdamW
finetuning experiments with errorbar were repeaded 3 times
google/mt5-small for fine-tuning
GPT-2 from scratch (LCUBE), Modu and Wiki corpora
byte-level BPE
50K for base and 100K for medium
compared KoGPT2 and LCUBE
Case Name, Statiute, LJP-Civil : Examt Match
LJP-Criminal : F1 of individual fields
pretrain with Precedent Corput only also performed well in domain adaptation
in summarization task, LCUBE doesn't have an advantage over other models
the first large-scale Korean legal AI benchmark and legal language model LCUBE
only considered precedents from the first level courts
didn't used plaintiffs and defendants claims
difficult to separate the claims from reasoning sections without error
didn't consider many important legal applications of AI