Benchmark Dashboard

Evaluation pipeline results for ReasonAI retrieval and generation.

130
Prompts
75
Topics

Model Configuration

Retrieval Judges

Score article relevance (1-100) per prompt. Consensus = median. k=16.

NameProviderModelConcurrency
haiku-4.5 anthropic claude-haiku-4-5-20251001 3
gemini-3.1-lite google gemini-3.1-flash-lite-preview 5
gpt-5-nano openai gpt-5-nano 10

E2E Judges

Score responses on 5 dimensions (factuality, topicality, clarity, helpfulness, examples). Max 15, pass ≥ 12.

NameProviderModelConcurrency
claude-sonnet anthropic claude-sonnet-4-6 3
gemini-pro google gemini-2.5-pro 3
gpt-4o openai gpt-4o 3

Embedding Models

4 embedders × 3 strategies = 12 configs.

Embedders
nomic-768
gemini-1536
openai-1536
qwen3-4096
Strategies
raw
enriched
summary

Quick Links