Benchmark Dashboard
Evaluation pipeline results for ReasonAI retrieval and generation.
Model Configuration
Retrieval Judges
Score article relevance (1-100) per prompt. Consensus = median. k=16.
| Name | Provider | Model | Concurrency |
|---|---|---|---|
| haiku-4.5 | anthropic | claude-haiku-4-5-20251001 |
3 |
| gemini-3.1-lite | gemini-3.1-flash-lite-preview |
5 | |
| gpt-5-nano | openai | gpt-5-nano |
10 |
E2E Judges
Score responses on 5 dimensions (factuality, topicality, clarity, helpfulness, examples). Max 15, pass ≥ 12.
| Name | Provider | Model | Concurrency |
|---|---|---|---|
| claude-sonnet | anthropic | claude-sonnet-4-6 |
3 |
| gemini-pro | gemini-2.5-pro |
3 | |
| gpt-4o | openai | gpt-4o |
3 |
Embedding Models
4 embedders × 3 strategies = 12 configs.
Embedders
nomic-768gemini-1536openai-1536qwen3-4096
Strategies
raw
enriched
summary