Andrew Ginns

🧜‍♀️ Merbench - LLM Evaluation

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

Evaluation Summary

1159
Total Evaluation Runs
25
Models Evaluated
3
Test Cases

Providers Tested

AmazonGoogleOSS
Data updated: Oct 17, 2025
Difficulty:
Provider:
Model:
What do these metrics mean?
Success Rate
The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run
The average cost in USD to generate one diagram, based on provider pricing.
Price/Success
The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration
The average time in seconds taken to generate a diagram.
Avg Tokens
The average number of tokens (input + output) used per run.
Runs
The total number of times this model was run in the evaluation.

Model Leaderboard

Rank Model Success Rate Avg Cost/Run Price/Success Avg Duration Avg Tokens Runs Provider
1 gemini-2.5-flash-preview-09-2025
31.1%
$0.0206
$0.0661 32.33s 22,980.822 45 Google
2 gemini-2.5-pro-preview-06-05
29.4%
$0.0383
$0.1302 36.84s 8,111.882 51 Google
3 gemini-2.5-pro-preview-05-06
26.7%
$0.1308
$0.4904 49.85s 19,753.911 45 Google
4 gemini-2.5-pro-preview-03-25
22.9%
$0.1133
$0.4942 57.17s 16,393.313 48 Google
5 gemini-2.5-pro
20.0%
$0.0544
$0.2722 32.94s 14,255.511 45 Google
6 gemini-2.5-flash
13.3%
$0.0128
$0.0957 10.15s 6,990.467 45 Google
7 qwen3-30b-a3b-thinking-2507-mlx
10.3%
$0.0017
$0.0168 92.27s 8,166.795 39 OSS
8 seed-oss-36b-instruct-mlx
6.3%
$0.0009
$0.0150 396.96s 3,053.438 16 OSS
9 gemini-2.5-flash-preview-05-20
5.0%
$0.0101
$0.2014 9.75s 5,771.55 60 Google
10 gemini-2.5-flash-lite-preview-06-17
5.0%
$0.0008
$0.0163 4.40s 4,974.583 60 Google
11 gemini-2.5-flash-preview-04-17
4.4%
$0.0233
$0.5237 24.15s 10,492.711 45 Google
12 gemini-2.5-flash-lite
3.3%
$0.0013
$0.0382 5.90s 9,506.689 90 Google
13 us.amazon.nova-premier-v1:0
3.3%
$0.0356
$1.0692 63.19s 9,528.967 60 Amazon
14 gpt-oss-20b
2.2%
$0.0002
$0.0111 47.90s 3,896.022 45 OSS
15 us.amazon.nova-pro-v1:0
0.0%
$0.0008
N/A 49.53s 678.15 60 Amazon
16 us.amazon.nova-micro-v1:0
0.0%
$0.0001
N/A 18.83s 1,783.85 60 Amazon
17 us.amazon.nova-lite-v1:0
0.0%
$0.0002
N/A 24.54s 2,799.317 60 Amazon
18 gemini-2.0-flash
0.0%
$0.0003
N/A 4.21s 1,325.667 60 Google
19 google/gemma-3-27b
0.0%
$0.0007
N/A 120.44s 6,954.467 45 OSS
20 qwen3-coder-30b-a3b-instruct-mlx
0.0%
$0.0005
N/A 21.61s 4,383.356 45 OSS
21 qwen/qwen3-30b-a3b-2507
0.0%
$0.0006
N/A 23.04s 4,277.978 45 OSS
22 magistral-small-2509-mlx
0.0%
$0.0042
N/A 582.19s 4,438.333 15 OSS
23 llama-xlam-2-70b-fc-r
0.0%
$0.0027
N/A 238.06s 8,591.267 15 OSS
24 gemini-2.5-flash-lite-preview-09-2025
0.0%
$0.0011
N/A 5.68s 5,687.822 45 Google
25 xlam-2-32b-fc-r
0.0%
$0.0004
N/A 171.86s 8,210.2 15 OSS

Performance vs Efficiency Trade-offs

Loading chart data...

Performance by Difficulty Level

Loading chart data...

Token Usage Breakdown

Loading chart data...

Failure Analysis by Reason

Loading chart data...

Last updated: October 17, 2025 at 03:18 PM UTC