Gates Foundation

gatesfoundation

A workspace for the Gates Foundation DPI, FairFoward and Gooey teams focused on evals for low-resource languages plus the home of our Agriculture advisory work e.g. https://gooey.ai/ageval

332 Public Workflows
8 Members

💬

Share settings updated

2mo ago

Public

💬

Share settings updated

2mo ago

Public

💬

Share settings updated

2mo ago

Public

💬

Share settings updated

2mo ago

177 runs

Public

💬

Share settings updated

2mo ago

176 runs

Public

💬

Share settings updated

2mo ago

175 runs

Public

💬

Share settings updated

2mo ago

182 runs

Public

💬

Share settings updated

2mo ago

85 runs

Public

A bulk evaluator workflow that compares AI-generated answers (copilot responses) to a set of golden reference answers. Requires input data columns: "input_prompt" (the question/task) and "reference_answer" (the ideal response). The workflow uses custom evaluation prompts to compare outputs, scoring them for accuracy and penalizing hallucinations. Aggregates results to provide an overall performance metric for your AI answers.

⚖️

Share settings updated

2mo ago

Public

(Updated Jan 2026)
This page shows a test of many Swahili (Kiswahili) speech‑to‑text systems and, in some cases, Swahili → English translation pipelines.
eval image
We use the same Swahili audio clips for every system. Then we compare each system’s text output to a reference answer and give it a score between 0 and 1.
A higher score means the system is closer to the reference text and usually more accurate.

No.WorkflowAccuracy (Mean)Median Latency (s)
0GPT-4oAudio0.505.49
1GPT-Realtime0.455.13
2Jacaranda + GPT-5.10.944.05
3Jacaranda + Gemini 3 Pro0.968.84
4Jacaranda + GPT-5.1 + Goog MT0.914.13
5Omni + GPT-5.1 + GoogMT0.924.87
6Omni + Gemini 3 Pro0.968.54
7Omni + Gemini 3 Pro + GoogM0.969.12
8Gemini 3 Pro0.929.80
9Jacaranda + Gemini 3 Flash0.925.32
10Jacaranda + GPT-4.10.893.62
11Gemini 3 Flash0.855.95

On this page you can:

  • See which Swahili system or pipeline gets the best score
  • Compare different Swahili ASR and Swahili→English models side by side
  • Choose the best system for your app, call center, research, or product
  • Download all results for deeper analysis and custom reporting

(20 Qs Updated Jan 2025)

This page shows a test of many Hindi speech‑to‑text systems and, in some cases, Hindi → English translation pipelines.
hindi benchmark

We use the same Hindi audio clips for every system. Then we compare each system’s text output to a reference answer and give it a score between 0 and 1.
A higher score means the system is closer to the reference text and usually more accurate.

WorkflowAccuracy (mean score)Latency (median)
0 GPT4oAudio0.777.33
1 GPTRealtime0.736.64
2 GPT5.10.925.69
3 GPT4.10.905.67
4 Gemini 3 Pro0.969.88
5 Gemini 3 Flash0.927.58
6 Sarvam.AI0.555.98
7 Omnilingual+GPT5-mini0.969.00
8 Omnilingual+Gemini 3 Pro0.9310.14
9 Omnilingual+Gemini 3 Flash0.917.60
10 MMS+GoogMT+GPT4.10.914.94

On this page you can:

  • See which Hindi system or pipeline gets the best score
  • Compare different Hindi ASR and Hindi→English models side by side
  • Choose the best system for your app, call center, research, or product
  • Download all results for deeper analysis and custom reporting

🦾

Share settings updated

4mo ago

Public

💬

Share settings updated

4mo ago

101 runs

Public

💬

Share settings updated

4mo ago

51 runs

Public

(25 Qs Updated December 2025)

This page shows a test of many Kikuyu speech‑to‑text and Kikuyu→English systems.
evalinfo

We use the same Kikuyu audio for every system. Then we compare each system’s text to a reference answer and give it a score between 0 and 1.
A higher score means the system is closer to the reference text and usually more accurate.

Ranking Table

#WorkflowAccuracy (Mean)Median Latency (s)
0GPT‑Realtime0.055.06
1SunbirdV2 + Goog MT + Gem3Pro0.7812.38
2SunbirdV2 + GPT5.10.574.85
3SunbirdV2 + Gem3Pro0.8311.96
4Omni + Gem3pro0.7813.84
5Omni + Goog MT + Gem3Pro0.7412.75
6Omni + GPT5.10.189.73
7Gemini 3 Pro0.8112.02
8Gemini 3 Flash0.387.80
9SunbirdV2 + Gem3Flash0.757.23
10Meta MMS + GPT4.1 + GhanaNLP MT0.563.69

You can use this page to:

  • See which system gets the best score
  • Compare different models and pipelines side by side
  • Choose the best system for your app, research, or product
  • Download all results for deeper analysis

5mo ago

Public

(Updated Jan 2025)
This page shows a test of many English speech‑to‑text systems.
English Benchmark
We use the same English audio for every system. Then we compare each system’s text to a reference answer and give it a score between 0 and 1.
A higher score means the system is closer to the reference text and usually more accurate.

Ranking

WorkflowAccuracy (mean score)Latency (median, s)
GPT-4o Audio0.937.75
GPT-Realtime0.937.70
GPT‑5.20.915.99
Gemini 3 Pro0.8610.16
Llama 40.915.77
DeepSeek 3.20.886.17
Gemini 3 Flash0.897.77
GPT‑4.10.926.02

With this benchmark, you can:

  • See which system gets the best score
  • Compare different models and pipelines side by side
  • Choose the best system for your app, research, or product
  • Download all results for deeper analysis

5mo ago

Public

(25 Qs Updated 10 Jan 2025)
This page shows a test of many Kinyarwanda speech‑to‑text systems.
kinyarwanda
For each system, we play the same Kinyarwanda audio and capture its text output.
We then compare that text to a trusted reference answer and give a score between 0 and 1.

A higher score means the system output is closer to the reference text and usually more accurate.

WorkflowAccuracy (mean)Latency (median)
0 GPT‑Realtime0.045.58
1 Mbza+GPT‑5.10.923.46
2 Mbaza+Gemini 3 Pro0.958.98
3 Sunbird+GPT‑5.10.634.28
4 Mbaza+GPT‑5.1+GMT0.873.25
5 Omnilingual+GPT‑5.10.844.61
6 Omnilingual+Gemini 3 Pro0.959.97
7 Gemini 3 Pro0.909.54
8 Omnilingual+Gemini 3 Flash0.926.06
9 Mbaza+Gemini 3 Flash0.915.04

You can use these scores to:

  • See which system gets the best score
  • Compare different models and pipelines side by side
  • Choose the best system for your app, research, or product
  • Download all results for deeper analysis

5mo ago

Public