When LLM judges evaluate models, numbers can lie
It’s a common pain: you ask a Large Language Model to act as a judge and score outputs, then take the judge’s pass rate as the truth. But every LLM judge carries its own biases and error modes — and those biases can skew your evaluation. judgy is a compact Python library that helps you correct for those biases and estimate the true success rate, with bootstrap confidence intervals so you know how uncertain your estimate is.
What It Does
judgy provides a focused toolkit to move from raw LLM-judged pass/fail predictions to a bias-corrected estimate of your system’s true pass rate. It addresses two practical problems:
- LLM judges make systematic classification errors (false positives or false negatives).
- Raw observed pass rates on unlabeled data reflect judge behavior, not necessarily the true system performance.
Core capabilities
- Estimate judge True Positive Rate (TPR) and True Negative Rate (TNR) from a labeled test set.
- Apply a statistical correction (Rogan–Gladen style) to recover a point estimate of the true pass rate.
- Use bootstrap resampling to produce confidence intervals around the corrected estimate.
- Utilities to generate synthetic test/unlabeled data and research-style plots for sensitivity analyses.
Note: The method assumes your LLM judge is better than random (TPR + TNR > 1). If not, the correction is invalid.
Who It’s For
This library is designed for:
- Machine learning engineers and researchers who use LLMs as automatic evaluators.
- Evaluation teams that want principled, reproducible correction for judge bias.
- Data scientists exploring sensitivity of evaluation results to judge accuracy or label set size.
Skill-wise, a reader should be comfortable running Python 3.8+ code and interpreting basic statistical outputs. The APIs are intentionally simple for quick integration into evaluation pipelines.
How It Works
The implementation follows a clear, three-step approach (also documented in the README and implemented in src/judgy/core.py
):
- Judge accuracy estimation — compute the judge’s TPR (sensitivity) and TNR (specificity) using a human-labeled test set.
- Bias correction — apply the Rogan–Gladen-style correction formula to transform the observed pass rate on the unlabeled set into an estimate of the true pass rate:
θ̂ = (p_obs + TNR - 1) / (TPR + TNR - 1)
where p_obs is the judge’s observed pass rate on unlabeled data.
- Quantify uncertainty — use bootstrap resampling over the test set to produce a distribution of corrected estimates and derive confidence intervals (the library defaults to 20,000 bootstrap iterations, configurable).
Technical architecture and files
src/judgy/core.py
— main estimator: input validation, TPR/TNR computation, correction, and bootstrap CI logic.src/judgy/plotting.py
— plotting helpers to produce research-style sensitivity figures (optional, requires theplotting
extras).src/judgy/synthetic.py
— synthetic data generators used in experiments and examples.examples/example.py
— demonstration script that runs sensitivity experiments and generates plots.Makefile
andpyproject.toml
— packaging, build, test and developer workflows.
Dependencies are minimal: numpy
is required; matplotlib
is optional for plotting (installed with the [plotting]
extra).
Getting Started
Basic installation and development setup are provided in the project. Use one of the following installation options depending on your needs.
Install from PyPI (simple)
pip install judgy
Development install from the repository (with plotting and dev deps)
# Clone the repository
git clone https://github.com/ai-evals-course/judgy.git
cd judgy
# Install with development and plotting extras
pip install -e .[dev,plotting]
Quick usage (Python)
Once installed you can estimate a corrected pass rate with a few lines (the example mirrors the README):
import numpy as np
from judgy import estimate_success_rate
test_labels = [1, 1, 0, 0, 1, 0, 1, 0]
test_preds = [1, 0, 0, 1, 1, 0, 1, 0]
unlabeled_preds = [1, 1, 0, 1, 0, 1, 0, 1]
theta_hat, lower, upper = estimate_success_rate(
test_labels=test_labels,
test_preds=test_preds,
unlabeled_preds=unlabeled_preds
)
print(f"Estimated true pass rate: {theta_hat:.3f}")
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")
Key Features
- Bias-corrected point estimates of pass rates using observed judge behavior (TPR/TNR).
- Bootstrap confidence intervals to capture uncertainty from finite labeled test sets.
- Research-oriented plotting and sensitivity experiments (TPR/TNR sweeps, label-size analysis) implemented in
plotting.py
. - Synthetic data utilities to reproduce experiments, demo scenarios and to benchmark methodology under controlled conditions.
- Testing and packaging — comprehensive test suite (pytest) and Makefile targets for build/test/release workflows.
Why It’s Worth Trying
If you rely on LLM judges to evaluate model outputs, the uncorrected pass rate can mislead decisions about model quality or deployment readiness. judgy gives you a lightweight, principled way to correct those numbers and to report uncertainty.
Practical reasons to adopt it:
- Quick integration: a single function call yields a corrected estimate and CI.
- Transparent methodology: the correction formula and bootstrap approach are explicit and easy to audit.
- Useful for planning: plotting and label-size experiments help decide how many human labels you need for acceptable confidence.
Community & Project Health
The repository includes a full test suite (tests/
), packaging metadata (pyproject.toml
), and a permissive MIT license. The provided text does not include dynamic community metrics (GitHub stars, contributors count or issue activity), so please check the live GitHub page below for up-to-date community statistics.
GitHub Link
All sources, examples and contribution instructions are available in the project repository. The official repository for this project is at:
https://github.com/ai-evals-course/judgy
Final Thoughts
judgy is a practical, well-tested tool for anyone who uses LLMs as automated judges. It bridges the gap between what an LLM reports and what is likely true about your system, helping you produce more honest, defensible evaluations. The included plotting and synthetic data utilities make it easy to study sensitivity to judge errors and to plan how many human labels to collect.
If you’re assessing models with LLM-based judges, try the PyPI package or run the example scripts from the cloned repo to see how bias correction changes your conclusions.