Testing Costs & Transparency

We believe in radical transparency. Here's exactly how much it cost to test every prompt on this site with real AI models.

Total Investment
$26.30
In AI API costs
Prompts Tested
80
100% coverage
AI Outputs
722
729 total generated
Avg Per Prompt
$0.32
3 scenarios × 3 models

Model Performance & Value Analysis

Based on 722 successful outputs across 80 prompts

Note on Evaluator Bias: All outputs are evaluated by Claude Sonnet 4.5. This may introduce slight bias toward Claude's own outputs. However, the difference is minimal (Claude: 8.42/10 vs GPT-5: 8.23/10 = 0.19 points). We tested dual-evaluation with Gemini - see results below.
🏆 BEST VALUE

Claude Sonnet 4.5

Avg Quality: 8.42/10
Avg Cost: $0.044
Cost per Point: $0.0052
Consistency: High (±1.56)

Highest quality scores, mid-range pricing, most consistent outputs. Best overall choice.

HIGHEST COST

ChatGPT (GPT-5)

Avg Quality: 8.23/10
Avg Cost: $0.064
Cost per Point: $0.0078
Consistency: Medium (±1.66)

50% more expensive than Claude for slightly lower quality. Popular but not best value.

💰 BEST BUDGET

Gemini 2.5 Flash

Avg Quality: 7.71/10
Avg Cost: $0.0014
Cost per Point: $0.0002
Consistency: Medium (±1.64)

46x cheaper than GPT-5! Lower quality but incredible value for quick drafts.

Total Cost by AI Model

ChatGPT (GPT-5) $15.46
58.8%
Most expensive, highest quality scores
Claude (Sonnet 4.5) $10.49
39.9%
Mid-range cost, excellent at following instructions
Gemini 2.5 Flash $0.35
1.3%
Lowest cost, fast responses, good quality

Cost by Category

Fundraising $6.88
26.2%
Communications $5.80
22.1%
Events $4.83
18.4%
Programs $3.62
13.8%
Board $2.93
11.1%
Operations $2.24
8.5%

Testing Methodology

Every prompt on Nonprofit.ai undergoes rigorous testing before publication:

  1. Scenario Generation: AI creates 3 realistic nonprofit contexts (small community org, mid-size professional org, large established org)
  2. Multi-Model Execution: Each scenario is tested with ChatGPT (GPT-5), Claude (Sonnet 4.5), and Gemini 2.5 Flash
  3. AI Evaluation: Outputs are scored for tone, completeness, usefulness, accuracy, and authenticity
  4. Quality Threshold: We aim for 8.0+ average scores across all outputs

This means each prompt generates 9 AI outputs (3 scenarios × 3 models), all evaluated for quality. Total: 729 AI API calls with 722 successful outputs (99.0% success rate).

Most Expensive to Test

Long-form prompts require more AI tokens:
event-planning-timeline $1.20
event-run-of-show $0.89
event-concept-brainstorm $0.83

Most Cost-Efficient

Shorter, focused prompts cost less:
thank-you-in-kind-donation $0.06
impact-stat-visualization $0.08
event-invitation-copy $0.08

Why We Share This

Most AI prompt libraries don't actually test their prompts. They write them, maybe try them once, and publish. We spent $26.30 in API costs to validate every single prompt with multiple AI models and realistic scenarios.

This transparency shows our commitment to quality. When you see "Tested 8.8/10" on a prompt page, that score comes from real AI evaluations of real outputs - not marketing claims.

All test results, evaluation scores, and example outputs are available on each prompt page. You can see exactly what ChatGPT, Claude, and Gemini generated for each scenario.

Quality Score Distribution by Model

How often each model produces outputs in different quality ranges

Claude Sonnet 4.5

9-10 (Excellent) 31.8%
8-8.9 (Good) 56.1%
7-7.9 (Fair) 5.0%
Below 7 (Poor) 7.1%

ChatGPT (GPT-5)

9-10 (Excellent) 20.8%
8-8.9 (Good) 62.5%
7-7.9 (Fair) 8.3%
Below 7 (Poor) 8.4%

Gemini 2.5 Flash

9-10 (Excellent) 1.6%
8-8.9 (Good) 58.0%
7-7.9 (Fair) 25.9%
Below 7 (Poor) 14.4%

Key Insight: Claude produces the most 9-10 scores (31.8% vs GPT-5's 20.8% vs Gemini's 1.6%), indicating it consistently delivers excellent outputs. GPT-5 is good but rarely exceptional, while Gemini is reliable but less likely to produce outstanding results.

How to Read Test Results on Prompt Pages

🎯

Quality Badge

The "Tested X/10" badge shows the average quality score across all 9 outputs. Scores of 8.0+ indicate high-quality, reliable prompts.

📊

Example Outputs

Each prompt page shows real outputs from 3 scenarios. Click between GPT-5, Claude, and Gemini tabs to compare how different models respond.

Evaluation Details

Expand "AI Evaluation Details" to see scores for tone, completeness, usefulness, accuracy, plus strengths and weaknesses of each output.

Dual-Evaluator Experiment

To address concerns about evaluator bias (Claude evaluating its own outputs), we conducted an experiment: We re-evaluated a random sample of 45 outputs using Gemini 2.5 Flash as a second evaluator.

Results

  • Correlation: 0.149 - Very low agreement between Claude and Gemini evaluations
  • Gemini gave 8.5 to 87% of outputs (39/45) - regardless of actual quality
  • Claude showed wide score distribution (3.2 to 9.2) - proper discrimination
  • Gemini failed to discriminate - anchored on 8.5 as safe middle score

Conclusion

Gemini proved to be an unreliable evaluator, giving nearly identical scores to outputs of varying quality. Claude's evaluations show proper score distribution (see histograms above), indicating it's actually judging quality differences rather than defaulting to safe scores.

Decision: We continue using Claude as the single evaluator. While this introduces potential bias toward Claude's outputs, the bias is minimal (0.19 points) and dual-evaluation with a poor evaluator would hurt methodology rather than help it.

This experiment demonstrates our commitment to methodological rigor - we question our own processes and validate our assumptions with data.