LLM Output Comparator
Compare 2–4 LLM outputs side by side with diff, metrics, and Markdown / JSON rendering
Tokyo is the capital of Japan. It is the most populous metropolitan area in the world.
Tokyo is the capital city of Japan. Greater Tokyo is the world’s most populous metropolitan area.
| Metric | GPT-4.1 | Claude 4.7 |
|---|---|---|
| Tokens (GPT-4.1) | — | — |
| Chars | 86 | 97 |
| Words | 16 | 17 |
| Sentences | 2 | 2 |
| Code blocks | 0 | 0 |
| Bullet points | 0 | 0 |
| Jaccard similarity (vs col 1) | 1.00 | 0.69 |
| Contains refusal | No | No |
| Parses as JSON | No | No |
Metrics are surface-level (token counts, word overlap). They are not a quality ranking — judge semantics yourself.
Frequently Asked Questions
What does the LLM Output Comparator do?
Paste 2 to 4 model outputs (GPT, Claude, Gemini, your own fine-tune) and get them side by side with length, similarity, and diff highlighting. It answers the everyday question "did changing prompt X or swapping to model Y actually make the answer better, or just different?"
Should I compare 2, 3, or 4 outputs at once?
- **2**: A/B test — old prompt vs new prompt, or GPT-4.1 vs Claude Opus 4.7 - **3**: adds a "control" — e.g. base model, current prompt, proposed prompt - **4**: multi-model bake-off across a provider lineup More than 4 becomes hard to read; at that point export to a spreadsheet and evaluate structurally.
What metrics are shown?
Per output: character count, word count, approximate token count, refusal detection (looks for "I cannot" / "I'm sorry" patterns), and pairwise Jaccard similarity between all outputs. For JSON outputs, a field-level diff highlights which keys differ rather than just flagging the whole blob as different.
How is JSON-level diff different from text diff?
Text diff compares characters and gets thrown off by key ordering, whitespace, and trailing commas. JSON diff parses both sides, walks the object tree, and reports "field `user.age` changed 30 → 31" regardless of formatting. Use it when comparing structured tool_call outputs or function-calling responses.
Is my data sent anywhere?
No. All comparisons, tokenization, and Markdown / JSON rendering happen in the browser. You can paste full customer conversations or internal evaluation data without it leaving your machine.
How is this different from promptfoo or OpenAI Evals?
promptfoo and Evals are batch frameworks: you define a dataset, assertions, and graders, then run thousands of cases. This tool is the opposite end — inspecting two to four specific outputs by hand when you are debugging, not benchmarking. Use them together: eyeball here, scale up in promptfoo once you know what to look for.