Merbench - Andrew Ginns

Getting LLMs to consistently nail the Mermaid diagram syntax can be... an adventure.

Merbench evaluates an LLM's ability to autonomously write and debug Mermaid syntax. The agent can access an MCP server that validates its code and provides error feedback, guiding it towards a correct solution.

Each model is tested across three difficulty levels, with a limited number of five attempts per test case. Performance is measured by the final success rate, averaged over complete runs, reflecting both an understanding of Mermaid syntax and effective tool usage.

What do these metrics mean?

Success Rate: The percentage of successful Mermaid diagram generations out of all runs.
Avg Cost/Run: The average cost in USD to generate one diagram, based on provider pricing.
Price/Success: The effective cost for each successful diagram, calculated as (Avg Cost / Success Rate).
Avg Duration: The average time in seconds taken to generate a diagram.
Avg Tokens: The average number of tokens (input + output) used per run.
Runs: The total number of times this model was run in the evaluation.

Model Leaderboard

Rank	Model	Success Rate ↓	Avg Cost/Run	Price/Success	Avg Duration	Avg Tokens	Runs	Provider
1	gemini-2.5-flash-preview-09-2025	31.1%	$0.0206	$0.0661	32.33s	22,980.822	45	Google
2	gemini-2.5-pro-preview-06-05	29.4%	$0.0383	$0.1302	36.84s	8,111.882	51	Google
3	gemini-2.5-pro-preview-05-06	26.7%	$0.1308	$0.4904	49.85s	19,753.911	45	Google
4	gemini-2.5-pro-preview-03-25	22.9%	$0.1133	$0.4942	57.17s	16,393.313	48	Google
5	gemini-2.5-pro	20.0%	$0.0544	$0.2722	32.94s	14,255.511	45	Google
6	gemini-2.5-flash	13.3%	$0.0128	$0.0957	10.15s	6,990.467	45	Google
7	qwen3-30b-a3b-thinking-2507-mlx	10.3%	$0.0017	$0.0168	92.27s	8,166.795	39	OSS
8	seed-oss-36b-instruct-mlx	6.3%	$0.0009	$0.0150	396.96s	3,053.438	16	OSS
9	gemini-2.5-flash-preview-05-20	5.0%	$0.0101	$0.2014	9.75s	5,771.55	60	Google
10	gemini-2.5-flash-lite-preview-06-17	5.0%	$0.0008	$0.0163	4.40s	4,974.583	60	Google
11	gemini-2.5-flash-preview-04-17	4.4%	$0.0233	$0.5237	24.15s	10,492.711	45	Google
12	gemini-2.5-flash-lite	3.3%	$0.0013	$0.0382	5.90s	9,506.689	90	Google
13	us.amazon.nova-premier-v1:0	3.3%	$0.0356	$1.0692	63.19s	9,528.967	60	Amazon
14	gpt-oss-20b	2.2%	$0.0002	$0.0111	47.90s	3,896.022	45	OSS
15	us.amazon.nova-pro-v1:0	0.0%	$0.0008	N/A	49.53s	678.15	60	Amazon
16	us.amazon.nova-micro-v1:0	0.0%	$0.0001	N/A	18.83s	1,783.85	60	Amazon
17	us.amazon.nova-lite-v1:0	0.0%	$0.0002	N/A	24.54s	2,799.317	60	Amazon
18	gemini-2.0-flash	0.0%	$0.0003	N/A	4.21s	1,325.667	60	Google
19	google/gemma-3-27b	0.0%	$0.0007	N/A	120.44s	6,954.467	45	OSS
20	qwen3-coder-30b-a3b-instruct-mlx	0.0%	$0.0005	N/A	21.61s	4,383.356	45	OSS
21	qwen/qwen3-30b-a3b-2507	0.0%	$0.0006	N/A	23.04s	4,277.978	45	OSS
22	magistral-small-2509-mlx	0.0%	$0.0042	N/A	582.19s	4,438.333	15	OSS
23	llama-xlam-2-70b-fc-r	0.0%	$0.0027	N/A	238.06s	8,591.267	15	OSS
24	gemini-2.5-flash-lite-preview-09-2025	0.0%	$0.0011	N/A	5.68s	5,687.822	45	Google
25	xlam-2-32b-fc-r	0.0%	$0.0004	N/A	171.86s	8,210.2	15	OSS

🧜‍♀️ Merbench - LLM Evaluation

Evaluation Summary

Providers Tested

Model Leaderboard

Performance vs Efficiency Trade-offs

Performance by Difficulty Level

Token Usage Breakdown

Failure Analysis by Reason