Save as New
Our general bulk evaluator to compare AI generated copilot answers against a collection of golden Answers.
Run
13d ago
Examples
API
Preview
Edit
https://storage.googleapis.com/dara-c1b52.appspot.com/daras_ai/media/05b27832-46ab-11f0-b9d7-02420a0001b8/evaluator-9.csv
Loading...
Upload or link to a CSV or google sheet that contains your sample input data.For example, for Copilot, this would sample questions or for Art QR Code, would would be pairs of image descriptions and URLs.Remember to includes header names in your CSV too.
Show as Links
Here's what you uploaded:
GPT-4o (openai)
GPT-4.1 (openai)
GPT-4.1 Mini (openai)
GPT-4.1 Nano (openai)
GPT-4.5 (openai)
o4-mini (openai)
o3 (openai)
o3-mini (openai)
o1 (openai)
GPT-4o-mini (openai)
GPT-4 Turbo with Vision (openai) [Redirects to GPT-4o (openai)]
GPT-4 Turbo (openai) [Redirects to GPT-4o (openai)]
ChatGPT (openai) [Redirects to GPT-4o-mini (openai)]
DeepSeek R1
Llama 4 Maverick Instruct
Llama 4 Scout Instruct
Llama 3.3 70B
Llama 3.2 90B + Vision (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.2 11B + Vision (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.2 3B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.2 1B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.1 405B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.1 70B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3.1 8B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3 70B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Llama 3 8B (Meta AI) [Redirects to Llama 4 Maverick Instruct]
Pixtral Large 24/11
Mistral Large 24/11
Mistral Small 25/01
Mixtral 8x7b Instruct v0.1 [Deprecated] [Redirects to Mistral Small 25/01]
Gemma 2 9B (Google)
Gemma 7B (Google) [Redirects to Gemma 2 9B (Google)]
Gemini 2.5 Pro (Google)
Gemini 2.5 Flash (Google)
Gemini 2 Flash Lite (Google)
Gemini 2 Flash (Google)
Gemini 1.5 Flash (Google) [Redirects to Gemini 2 Flash (Google)]
Gemini 1.5 Pro (Google) [Redirects to Gemini 2.5 Pro (Google)]
Claude 4 Sonnet (Anthropic)
Claude 4 Opus (Anthropic)
Claude 3.7 Sonnet (Anthropic)
Claude 3.5 Sonnet (Anthropic) [Redirects to Claude 3.7 Sonnet (Anthropic)]
Claude 3 Opus (Anthropic) [Redirects to Claude 3.7 Sonnet (Anthropic)]
Claude 3 Sonnet (Anthropic) [Redirects to Claude 3.7 Sonnet (Anthropic)]
Claude 3 Haiku (Anthropic) [Redirects to Claude 3.7 Sonnet (Anthropic)]
Llama 3 Groq 70b Tool Use [Deprecated] [Redirects to GPT-4o-mini (openai)]
Llama 3 Groq 8b Tool Use [Deprecated] [Redirects to GPT-4o-mini (openai)]
Specify custom LLM prompts to calculate metrics that evaluate each row of the input data. The output should be a JSON object mapping the metric names to values.The columns dictionary can be used to reference the spreadsheet columns.
columns
Add a Prompt
Aggregate using one or more operations. Uses pandas.
mean
median
min
max
sum
cumsum
prod
cumprod
std
var
first
last
count
cumcount
nunique
rank
Add an Aggregation
⚙️ Settings
Run cost = 90 credits
With each run, you agree to Gooey.AI's terms & privacy policy.
ℹ️ Details
🙋🏽♀️ Need more help? Join our Discord
Which AI model actually works best for your needs? Upload your own data and evaluate any Gooey.AI workflow, LLM or AI model against any other. Great for large data sets, AI model evaluation, task automation, …
Gooey.AI's Copilot Builder is the best chatbot builder anywhere, combining your choice of LLMs (GPT4o/4.1, Gemini2.5, Claude4, Mixtral or LLaMA4), knowledge docs from any link or doc/PDF (with table …
Transcribe mp3s, WhatsApp voice, YouTube videos in 1000+ langs with Meta’s MMS /Seemless M4T, OpenAI's GPT4o Audio LLM, Whisper v2/v3, Azure, Google, GhanaNLP, AI4Bharat & Bhasini ASR models. Optionally …
We've built the best Retrieval Augmented Generation (RAG) as-a-Service anywhere - now with page-level citations! Absorb tables, PDFs, docs, links, videos or audio clips and use our synthetic data maker to …