Question 1

What does the LLM Output Comparator do?

Accepted Answer

Paste 2 to 4 model outputs (GPT, Claude, Gemini, your own fine-tune) and get them side by side with length, similarity, and diff highlighting. It answers the everyday question "did changing prompt X or swapping to model Y actually make the answer better, or just different?"

Question 2

Should I compare 2, 3, or 4 outputs at once?

Accepted Answer

- **2**: A/B test — old prompt vs new prompt, or GPT-4.1 vs Claude Opus 4.7
- **3**: adds a "control" — e.g. base model, current prompt, proposed prompt
- **4**: multi-model bake-off across a provider lineup

More than 4 becomes hard to read; at that point export to a spreadsheet and evaluate structurally.

Question 3

What metrics are shown?

Accepted Answer

Per output: character count, word count, approximate token count, refusal detection (looks for "I cannot" / "I'm sorry" patterns), and pairwise Jaccard similarity between all outputs. For JSON outputs, a field-level diff highlights which keys differ rather than just flagging the whole blob as different.

Question 4

How is JSON-level diff different from text diff?

Accepted Answer

Text diff compares characters and gets thrown off by key ordering, whitespace, and trailing commas. JSON diff parses both sides, walks the object tree, and reports "field `user.age` changed 30 → 31" regardless of formatting. Use it when comparing structured tool_call outputs or function-calling responses.

Question 5

Is my data sent anywhere?

Accepted Answer

No. All comparisons, tokenization, and Markdown / JSON rendering happen in the browser. You can paste full customer conversations or internal evaluation data without it leaving your machine.

Question 6

How is this different from promptfoo or OpenAI Evals?

Accepted Answer

promptfoo and Evals are batch frameworks: you define a dataset, assertions, and graders, then run thousands of cases. This tool is the opposite end — inspecting two to four specific outputs by hand when you are debugging, not benchmarking. Use them together: eyeball here, scale up in promptfoo once you know what to look for.

Metric	GPT-4.1	Claude 4.7
Tokens (GPT-4.1)	—	—
Chars	86	97
Words	16	17
Sentences	2	2
Code blocks	0	0
Bullet points	0	0
Jaccard similarity (vs col 1)	1.00	0.69
Contains refusal	No	No
Parses as JSON	No	No

LLM Output Comparator

Frequently Asked Questions

Related Tools

Prompt Template Tester

LLM Token Counter

Diff Checker