Intruducing a method for detecting LLM-generated text using zero-shot setting (No training sample from LLM source)
outperforms all models with ChatGPT detection
As it is zero-shot nature, it can spot multiple different LLMs with high accuracy
Prior research (Turnitin) fixated strongly on ChatGPT
More sophisticated actors use a wide range of LLMs beyond just ChatGPT
Binoculars works by viewing text through two lenses
compute the log perplexity of the text in question using an "observer LLM"
compute all the next-token predictions that a "performer LLM" would make and compute their perplexity according to the observer
If the string is written by a machine, the perplexities would be similar.
2. The LLM Detection Landscape
Spam and Fake news analyzing → all benefit from signals that quantify whether text is human or machine-generated
Due to the rise of the Transformer models, primitive mechanisms became useless → to record or watermark all generated text
Post-hoc detection approaches without cooperation from model owners
Fine-Tuned pretrained backbone for the binary classification task (adversarial training, absentation)
Linear classifier on top of frozen learned features allowing for the inclusion of commercial API outputs
Using statistical signatures that are characteristic of machine-generated text
requires none or little training data
easily adapted to newer model families
based on perplexity, perplexity curvature, log rank, intrinsic dimensionality of generated text, n-gram
Detection has limitation
Fully general-purpose models of language would be, by definition, impossible to detect
Given sufficient examples, the text by model close to the optimum is technically detectable
In practice, the relative success of detection provides evidence that current language models are imperfect representations of human writing (Detectable!)
How do we appropriately and thoroughly evaluate detectors?
accuracy on test sets, AUC of classifiers are not well-suited for the highstakes question of detection
Only detectors with low false positive truely reduce harm
detectors are often only evaluated on relatively easy datasets that are reflexive of their training data
3. Binoculars: How it works
perplexity and cross-perplexity (the next token predictions of one model are to another model)
3.1 Background and Notation
string s
a list of token indices x
tokenizer T
i-th token ID xi
vocab V={1,2,...,n}
language model M
number of tokens in s, L
M(T(s))=M(x)=YYij=P(vi∣x0:i−1) for allj∈V
Define logPPL as the average negative log-likelihood of all tokens in the given sequence
logPPLM(s)=−L1i=1∑Llog(Yixi)
This logPPL intuitively measures how surprising a string is to a language model
As it is used as a loss function, the models are likely to score their own outputs as unsurprising
Define Cross-Perplexity as a average per-token cross-entropy between the outputs of two models
logX-PPLM1,M2(s)=−L1i=1∑LM1(s)i⋅log(M2(s)i)where ⋅ means the dot product
3.2 What makes detection Hard? A primer on the Capybara problem
LLM tends to generate text that is unsurprising to an LLM
As humans are different from machine, human PPL is higher according to an LLM observer
When it faces hand-crafted prompts, this intuition breaks
prompt "1, 2, 3, " results in "4, 5, 6" which has very low PPL
But the prompt like "Can you wirte a few sentences about a capybara that is an astrophysicist?" will yield a response that seems more strange → High PPL ("capybara", "astrophysicist")
in the absence of the prompt, LLM detection seems difficult and naive perplexity-based detection fails
3.3 Our Detection Score
Binoculars solves the capybara problem by providing a mechanism for estimating the baseline PPL induced by the prompt
Motivation
LM generates Low-PPL text relative to humans → PPL Threshold classifier
Capybara problem → prompt matters → Cross-PPL
Cross-PPL measures the tokens are surprising relative to the baseline PPL of an LLM acting on the same string
Expect the next-token choices of humans to be even higher PPL than those of the machine → Normalize the observed PPL by the expected PPL of a machine acting on the same text
BM1,M2(s)=logX-PPLM1,M2(s)logPPLM1(s)
The numerator is simple PPL (how surprising a string is to M1)
The denominator measures how surprising the token predictions of M2 are when observed by M1
Expect human diverge from M1 more than M2 diverges from M1
The Binoculars score B is a general mechanism that captures a statistical signature of machine text
It is also capable of detecting generic machine-text generated by a third model altogether
Connection to other approaches
Contrastive Decoding : generate high-quality text by maximizing the difference between a weak and a stron gmodel
Speculative Decoding : Use weaker models to plan completions
Both are working when pairing a strong model with a very seak model
But Binoculars works well when pairing very close two models (use Falcon-7B as M1 and Falcon-7b-instruct as M2)