Testing Costs & Transparency
We believe in radical transparency. Here's exactly how much it cost to test every prompt on this site with real AI models.
Model Performance & Value Analysis
Based on 722 successful outputs across 80 prompts
Claude Sonnet 4.5
Highest quality scores, mid-range pricing, most consistent outputs. Best overall choice.
ChatGPT (GPT-5)
50% more expensive than Claude for slightly lower quality. Popular but not best value.
Gemini 2.5 Flash
46x cheaper than GPT-5! Lower quality but incredible value for quick drafts.
Total Cost by AI Model
Cost by Category
Testing Methodology
Every prompt on Nonprofit.ai undergoes rigorous testing before publication:
- Scenario Generation: AI creates 3 realistic nonprofit contexts (small community org, mid-size professional org, large established org)
- Multi-Model Execution: Each scenario is tested with ChatGPT (GPT-5), Claude (Sonnet 4.5), and Gemini 2.5 Flash
- AI Evaluation: Outputs are scored for tone, completeness, usefulness, accuracy, and authenticity
- Quality Threshold: We aim for 8.0+ average scores across all outputs
This means each prompt generates 9 AI outputs (3 scenarios × 3 models), all evaluated for quality. Total: 729 AI API calls with 722 successful outputs (99.0% success rate).
Most Expensive to Test
Most Cost-Efficient
Why We Share This
Most AI prompt libraries don't actually test their prompts. They write them, maybe try them once, and publish. We spent $26.30 in API costs to validate every single prompt with multiple AI models and realistic scenarios.
This transparency shows our commitment to quality. When you see "Tested 8.8/10" on a prompt page, that score comes from real AI evaluations of real outputs - not marketing claims.
All test results, evaluation scores, and example outputs are available on each prompt page. You can see exactly what ChatGPT, Claude, and Gemini generated for each scenario.
Quality Score Distribution by Model
How often each model produces outputs in different quality ranges
Claude Sonnet 4.5
ChatGPT (GPT-5)
Gemini 2.5 Flash
Key Insight: Claude produces the most 9-10 scores (31.8% vs GPT-5's 20.8% vs Gemini's 1.6%), indicating it consistently delivers excellent outputs. GPT-5 is good but rarely exceptional, while Gemini is reliable but less likely to produce outstanding results.
How to Read Test Results on Prompt Pages
Quality Badge
The "Tested X/10" badge shows the average quality score across all 9 outputs. Scores of 8.0+ indicate high-quality, reliable prompts.
Example Outputs
Each prompt page shows real outputs from 3 scenarios. Click between GPT-5, Claude, and Gemini tabs to compare how different models respond.
Evaluation Details
Expand "AI Evaluation Details" to see scores for tone, completeness, usefulness, accuracy, plus strengths and weaknesses of each output.
Dual-Evaluator Experiment
To address concerns about evaluator bias (Claude evaluating its own outputs), we conducted an experiment: We re-evaluated a random sample of 45 outputs using Gemini 2.5 Flash as a second evaluator.
Results
- Correlation: 0.149 - Very low agreement between Claude and Gemini evaluations
- Gemini gave 8.5 to 87% of outputs (39/45) - regardless of actual quality
- Claude showed wide score distribution (3.2 to 9.2) - proper discrimination
- Gemini failed to discriminate - anchored on 8.5 as safe middle score
Conclusion
Gemini proved to be an unreliable evaluator, giving nearly identical scores to outputs of varying quality. Claude's evaluations show proper score distribution (see histograms above), indicating it's actually judging quality differences rather than defaulting to safe scores.
Decision: We continue using Claude as the single evaluator. While this introduces potential bias toward Claude's outputs, the bias is minimal (0.19 points) and dual-evaluation with a poor evaluator would hurt methodology rather than help it.
This experiment demonstrates our commitment to methodological rigor - we question our own processes and validate our assumptions with data.