(20Qs) English Audio to Text Benchmark | Gemini 3 Pro, GPT‑5.2, Llama 4, DeepSeek v3.2

(Updated Jan 2025)
This page shows a test of many English speech‑to‑text systems.
English Benchmark
We use the same English audio for every system. Then we compare each system’s text to a reference answer and give it a score between 0 and 1.
A higher score means the system is closer to the reference text and usually more accurate.

Ranking

Workflow	Accuracy (mean score)	Latency (median, s)
GPT-4o Audio	0.93	7.75
GPT-Realtime	0.93	7.70
GPT‑5.2	0.91	5.99
Gemini 3 Pro	0.86	10.16
Llama 4	0.91	5.77
DeepSeek 3.2	0.88	6.17
Gemini 3 Flash	0.89	7.77
GPT‑4.1	0.92	6.02

With this benchmark, you can:

See which system gets the best score
Compare different models and pipelines side by side
Choose the best system for your app, research, or product
Download all results for deeper analysis

5mo ago

Gooey Workflows

Input Data Spreadsheet

Show as Links

Input Columns

Output Columns

Evaluation Workflows

⚙️ Settings

Run cost = 1 credits

With each run, you agree to Gooey.AI's terms & privacy policy.

Run: Compare Output Text (from input_audio) Download

Aggregate:Mean

Run: Compare Run Time (Median) Download

Aggregate:Median

🐞 Debug

🙋🏽‍♀️ Need more help? Join our Discord

(20Qs) English Audio to Text Benchmark | Gemini 3 Pro, GPT‑5.2, Llama 4, DeepSeek v3.2

Ranking

Gooey Workflows

Input Data Spreadsheet

Input Columns

Output Columns

Evaluation Workflows

🛠️ Developer Tools and Functions

Aggregate:Mean

Aggregate:Median

GET STARTED

LEARN

DEVELOPERS

SOCIAL

CONNECT

EXTRAS