Examples: Eval

Our general bulk evaluator to compare AI generated copilot answers against a collection of golden Answers.

⚖️

This recipe is used with https://gooey.ai/bulk to evaluate the latest private & open source speech recognition models (from Google, Meta, OpenAI and others). It takes a CSV file of golden (aka human provided) translations and compares those against a set of AI created translations to generate scores from 0 to 1. It then takes the mean of the scores to determine which model performed best.

⚖️

A bulk evaluator workflow that compares AI-generated answers (copilot responses) to a set of golden reference answers. Requires input data columns: "input_prompt" (the question/task) and "reference_answer" (the ideal response). The workflow uses custom evaluation prompts to compare outputs, scoring them for accuracy and penalizing hallucinations. Aggregates results to provide an overall performance metric for your AI answers.

⚖️

9mo ago

88 runs

Here we compare the top 5 ASR models from a set of Telugu samples. Speech output created from https://gooey.ai/bulk/?example_id=nrkx2u17

⚖️

2y ago

308 runs

Here we compare the top 3 ASR models from a set of Kannada samples. Speech output created from https://gooey.ai/bulk/?example_id=m8c3mb98

⚖️

2y ago

Here we compare the top 6 ASR models from a set of Hindi samples. Speech translations created from https://gooey.ai/bulk/?example_id=ueki9up0.

⚖️

2y ago