Skip to content

Statistical testing and analysis framework for AI research

License

Notifications You must be signed in to change notification settings

North-Shore-AI/crucible_bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bench

CrucibleBench

Elixir Hex.pm Documentation License

Statistical Testing Framework for AI Research

A comprehensive statistical testing framework designed specifically for AI/ML research in Elixir. CrucibleBench provides rigorous statistical tests, effect size measures, power analysis, and publication-ready reporting.

Features

  • Parametric Tests: t-tests (independent, paired), ANOVA
  • Non-Parametric Tests: Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis
  • Effect Sizes: Cohen's d, Hedges' g, Glass's delta, eta-squared, omega-squared
  • Power Analysis: A priori and post-hoc power calculations
  • Confidence Intervals: Bootstrap and analytical methods
  • Experiment DSL: High-level API for A/B tests, ablation studies, hyperparameter sweeps
  • Export Formats: Markdown, LaTeX, HTML for publication

Design Principles

  1. Statistical Rigor: All implementations follow established statistical methods
  2. Interpretability: Every result includes effect sizes and practical significance
  3. Reproducibility: Complete audit trails for research reproducibility
  4. Peer-Review Ready: Publication-quality output suitable for academic papers

Installation

Add crucible_bench to your list of dependencies in mix.exs:

def deps do
  [
    {:crucible_bench, "~> 0.4.0"}
  ]
end

Or install from GitHub:

def deps do
  [
    {:crucible_bench, github: "North-Shore-AI/crucible_bench"}
  ]
end

Stage Contract

CrucibleBench.Stage implements the Crucible.Stage behaviour from crucible_framework.

Options

  • :tests - Statistical tests to run (default: [:ttest])
  • :alpha - Significance level (default: 0.05)
  • :confidence_level - Confidence level (default: 0.95)
  • :bootstrap_iterations - Bootstrap iterations (default: 1000)
  • :data_source - Data source (:outputs, :metrics, or {:custom, fn})

Schema Introspection

# Get stage schema
schema = CrucibleBench.Stage.describe(%{})
# => %{
#   __schema_version__: "1.0.0",
#   name: :bench,
#   description: "Statistical benchmarking and hypothesis testing",
#   required: [],
#   optional: [:tests, :alpha, :confidence_level, :bootstrap_iterations, :data_source],
#   types: %{...},
#   defaults: %{tests: [:ttest], alpha: 0.05, ...},
#   __extensions__: %{bench: %{...}}
# }

Pipeline Integration

CrucibleBench v0.4.0+ provides CrucibleBench.Stage for seamless integration with crucible_framework pipelines:

# In your pipeline configuration
context = %{
  experiment: %{
    reliability: %{
      stats: %CrucibleIR.Reliability.Stats{
        tests: [:ttest, :bootstrap],
        alpha: 0.05,
        confidence_level: 0.95,
        bootstrap_iterations: 2000
      }
    }
  },
  outputs: [0.85, 0.87, 0.84, 0.86, 0.88]
}

# Run statistical analysis
{:ok, updated_context} = CrucibleBench.Stage.run(context)

# Access results
updated_context.bench.tests
# => %{
#   ttest: %{test_type: :ttest, ...},
#   bootstrap: %{test_type: :bootstrap, confidence_interval: {0.84, 0.88}, ...}
# }

updated_context.bench.summary
# => %{n: 5, mean: 0.86, sd: 0.0141, median: 0.86}

Advanced Stage Configuration

The Stage supports multiple data layouts for different test types:

# Two-group comparison (t-test, Mann-Whitney)
context = %{
  experiment: %{reliability: %{stats: stats_config}},
  control: [0.72, 0.68, 0.75, 0.71, 0.69],
  treatment: [0.78, 0.73, 0.81, 0.76, 0.74]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
ctx.bench.tests.ttest
# => %{
#   test_type: :ttest,
#   statistic: -3.42,
#   p_value: 0.0089,
#   significant: true,
#   effect_size: %{cohens_d: -2.16, interpretation: "large"},
#   confidence_interval: {-0.095, -0.019}
# }

# Multi-group comparison (ANOVA, Kruskal-Wallis)
context = %{
  experiment: %{
    reliability: %{
      stats: %CrucibleIR.Reliability.Stats{
        tests: [:anova],
        alpha: 0.05
      }
    }
  },
  groups: [
    [0.89, 0.91, 0.88, 0.90, 0.92],  # Model A
    [0.87, 0.89, 0.86, 0.88, 0.90],  # Model B
    [0.84, 0.86, 0.83, 0.85, 0.87]   # Model C
  ]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
ctx.bench.tests.anova.effect_size.eta_squared
# => 0.72 (large effect)

# Paired comparison (paired t-test, Wilcoxon)
context = %{
  experiment: %{reliability: %{stats: stats_config}},
  before: [0.72, 0.68, 0.75, 0.71, 0.69],
  after: [0.78, 0.73, 0.81, 0.76, 0.74]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
# Automatically uses paired t-test

Metrics Merging

The Stage automatically merges statistical results into context.metrics:

{:ok, ctx} = CrucibleBench.Stage.run(context)

ctx.metrics.bench_n           # Sample size
ctx.metrics.bench_mean        # Mean value
ctx.metrics.bench_sd          # Standard deviation
ctx.metrics.bench_median      # Median value
ctx.metrics.bench_ttest_p_value  # P-value from t-test (if run)

This enables downstream pipeline stages to access statistical summaries directly.

Inspect-AI Eval Logs

CrucibleBench can adapt EvalEx results into inspect-ai-style eval logs for downstream analysis:

metrics = [
  %{accuracy: 1.0},
  %{accuracy: 0.0},
  %{accuracy: 1.0}
]

result = EvalEx.Result.new("inspect_evals/gsm8k", :testset, metrics, 3, 120)

log = CrucibleBench.EvalLog.from_eval_result(result, scorer_name: "llm_judge")

scores = CrucibleBench.EvalLog.Extract.eval_log_scores_dict(log)
stderr = CrucibleBench.EvalLog.Extract.eval_log_headline_stderr(log)

Using IR Configuration

You can also pass CrucibleIR.Reliability.Stats directly to comparison functions:

config = %CrucibleIR.Reliability.Stats{
  alpha: 0.01,
  confidence_level: 0.99,
  tests: [:ttest]
}

control = [0.72, 0.68, 0.75, 0.71, 0.69]
treatment = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare(control, treatment, config)
# Uses alpha=0.01 and 99% confidence interval

Quick Start

Compare Two Groups

# Compare control vs treatment groups
control = [0.72, 0.68, 0.75, 0.71, 0.69]
treatment = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare(control, treatment)
# => %CrucibleBench.Result{
#   test: :welch_t_test,
#   p_value: 0.0024,
#   effect_size: %{cohens_d: 1.25, interpretation: "large"},
#   confidence_interval: {0.02, 0.14}
# }

Paired Comparison

# Before/after measurements
before = [0.72, 0.68, 0.75, 0.71, 0.69]
after = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare_paired(before, after)

Compare Multiple Groups

# Compare 3+ groups with ANOVA
gpt4 = [0.89, 0.91, 0.88, 0.90, 0.92]
claude = [0.87, 0.89, 0.86, 0.88, 0.90]
gemini = [0.84, 0.86, 0.83, 0.85, 0.87]

result = CrucibleBench.compare_multiple([gpt4, claude, gemini])

Effect Size Analysis

# Calculate Cohen's d
effect = CrucibleBench.effect_size(control, treatment)
# => %{
#   cohens_d: 1.25,
#   interpretation: "large",
#   mean1: 0.71,
#   mean2: 0.764
# }

Confidence Intervals

# Calculate 95% CI for mean
data = [0.85, 0.87, 0.84, 0.86, 0.88]
ci = CrucibleBench.confidence_interval(data, :mean)
# => %{interval: {0.8432, 0.8768}, method: :analytical}

# Bootstrap CI for median
ci = CrucibleBench.confidence_interval(data, :median, method: :bootstrap)

Power Analysis

# A priori: Calculate required sample size
result = CrucibleBench.power_analysis(:t_test,
  analysis_type: :a_priori,
  effect_size: 0.5,    # Medium effect
  alpha: 0.05,
  power: 0.80          # 80% power
)
# => %{n_per_group: 64, recommendation: "Collect at least 64 samples per group..."}

# Post-hoc: Calculate achieved power
result = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: 0.5,
  n_per_group: 30,
  alpha: 0.05
)
# => %{power: 0.548, recommendation: "Marginal power..."}

High-Level Experiment DSL

A/B Testing

result = CrucibleBench.experiment(:ab_test,
  control: control_scores,
  treatment: treatment_scores,
  name: "Prompt Engineering Test"
)

# Comprehensive output includes:
# - Statistical significance
# - Effect size with interpretation
# - Power analysis
# - Recommendations

Ablation Study

result = CrucibleBench.experiment(:ablation,
  baseline: [0.85, 0.87, 0.84, 0.86, 0.88],
  without_component: [0.78, 0.76, 0.79, 0.77, 0.75],
  component_name: "Ensemble Voting"
)

# Shows performance drop and component importance

Hyperparameter Sweep

result = CrucibleBench.experiment(:hyperparameter_sweep,
  configurations: [config_a, config_b, config_c],
  labels: ["Config A", "Config B", "Config C"],
  correction_method: :holm # or :bonferroni, :benjamini_hochberg
)

# Identifies best configuration with pairwise comparisons
# Pairwise p-values are adjusted using the chosen correction method

Assumption Checks (Normality & Variance)

# Normality
NormalityTests.quick_check(data)          # fast skew/kurtosis screen
NormalityTests.assess_normality(data)     # Shapiro-Wilk + skew/kurtosis with recommendation

# Variance equality
VarianceTests.levene_test([g1, g2, g3])   # robust Brown-Forsythe (median-centered)
VarianceTests.f_test(g1, g2)              # classic F-test (assumes normality)
VarianceTests.quick_check(g1, g2)         # fast variance ratio heuristic
  • Use normality/variance checks to choose between parametric and non-parametric tests.
  • Constant or near-constant data is handled safely (no crashes).

Multiple Comparison Control

p_values = [0.01, 0.03, 0.04, 0.20]

# Adjust p-values
MultipleComparisons.correct(p_values, method: :holm)
MultipleComparisons.correct(p_values, method: :benjamini_hochberg, fdr_level: 0.10)

# Boolean rejections (uses the same alpha/FDR level)
MultipleComparisons.reject(p_values, method: :bonferroni)
  • Hyperparameter sweeps automatically apply corrections (:holm default); set correction_method: and optional fdr_level: to change behavior.
  • Exports include original and adjusted p-values plus significance under the chosen correction.

Export Results

Markdown

markdown = CrucibleBench.Export.to_markdown(result)
IO.puts(markdown)

LaTeX

latex = CrucibleBench.Export.to_latex(result)
# Generates LaTeX table for academic papers

HTML

html = CrucibleBench.Export.to_html(result)
# Generates styled HTML report

Experiment Reports

report = CrucibleBench.Export.experiment_to_markdown(ab_result)
# Comprehensive markdown report with interpretations

Statistical Tests Reference

Parametric Tests

Test Function Use Case
Independent t-test CrucibleBench.Stats.TTest.test/3 Compare 2 independent groups
Welch's t-test CrucibleBench.Stats.TTest.test/3 Compare 2 groups (unequal variance)
Paired t-test CrucibleBench.Stats.PairedTTest.test/3 Compare 2 related groups
One-way ANOVA CrucibleBench.Stats.ANOVA.one_way/2 Compare 3+ independent groups

Non-Parametric Tests

Test Function Use Case
Mann-Whitney U CrucibleBench.Stats.MannWhitney.test/3 Non-parametric alternative to t-test
Wilcoxon signed-rank CrucibleBench.Stats.Wilcoxon.test/3 Non-parametric alternative to paired t-test
Kruskal-Wallis CrucibleBench.Stats.KruskalWallis.test/2 Non-parametric alternative to ANOVA

Effect Sizes

Measure Function Interpretation
Cohen's d CrucibleBench.Stats.EffectSize.cohens_d/2 Standardized mean difference
Hedges' g CrucibleBench.Stats.EffectSize.hedges_g/2 Bias-corrected Cohen's d
Glass's delta CrucibleBench.Stats.EffectSize.glass_delta/2 Using control SD only
Eta-squared Included in ANOVA results Proportion of variance explained

Effect Size Interpretation

Based on Cohen (1988):

Cohen's d Interpretation
< 0.2 Negligible
0.2 - 0.5 Small
0.5 - 0.8 Medium
> 0.8 Large
Eta-squared (η²) Interpretation
< 0.01 Negligible
0.01 - 0.06 Small
0.06 - 0.14 Medium
> 0.14 Large

Module Structure

lib/crucible_bench/
├── bench.ex                          # Main API
├── result.ex                         # Result struct
├── stats.ex                          # Core statistics
├── analysis.ex                       # High-level analysis
├── experiment.ex                     # Experiment DSL
├── export.ex                         # Export/reporting
└── stats/
    ├── t_test.ex                     # Independent t-test
    ├── paired_t_test.ex              # Paired t-test
    ├── anova.ex                      # ANOVA
    ├── mann_whitney.ex               # Mann-Whitney U
    ├── wilcoxon.ex                   # Wilcoxon signed-rank
    ├── kruskal_wallis.ex             # Kruskal-Wallis
    ├── effect_size.ex                # Effect size measures
    ├── confidence_interval.ex        # CI calculations
    ├── power.ex                      # Power analysis
    ├── multiple_comparisons.ex       # p-value corrections (FWER/FDR)
    ├── normality_tests.ex            # Shapiro-Wilk + diagnostics
    ├── variance_tests.ex             # Levene, F-test, variance heuristics
    └── distributions.ex              # Probability distributions

Examples

See examples/basic_usage.exs for comprehensive examples covering:

  1. Independent samples t-test
  2. Paired t-test
  3. One-way ANOVA
  4. Effect size analysis
  5. Confidence intervals
  6. Power analysis
  7. A/B test experiment
  8. Ablation study
  9. Hyperparameter sweep
  10. Result export

Run examples:

mix run examples/basic_usage.exs

Testing

Run the test suite:

mix test

Run specific tests:

mix test test/bench_test.exs
mix test test/stats_test.exs
mix test test/effect_size_test.exs

Best Practices for AI Research

1. Always Report Effect Sizes

P-values alone don't tell the full story. Always include effect sizes:

result = CrucibleBench.compare(control, treatment)
IO.puts("P-value: #{result.p_value}")
IO.puts("Effect size: #{result.effect_size.cohens_d} (#{result.effect_size.interpretation})")

2. Check Statistical Power

Ensure your study has adequate power:

power = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: observed_effect,
  n_per_group: n,
  alpha: 0.05
)

if power.power < 0.8 do
  IO.puts("Warning: Underpowered study! #{power.recommendation}")
end

3. Use Confidence Intervals

CIs provide more information than p-values:

result = CrucibleBench.compare(group1, group2)
{lower, upper} = result.confidence_interval
IO.puts("95% CI: [#{lower}, #{upper}]")

4. Consider Practical Significance

Statistical significance ≠ practical significance:

if result.p_value < 0.05 and abs(effect.cohens_d) < 0.2 do
  IO.puts("Statistically significant but negligible effect size")
end

5. Use Experiment DSL for Complex Studies

The experiment DSL automates best practices:

result = CrucibleBench.experiment(:ab_test,
  control: control,
  treatment: treatment,
  name: "My Experiment"
)

# Automatically includes:
# - Appropriate test selection
# - Effect size calculation
# - Power analysis
# - Recommendations

Common Use Cases in AI Research

Compare Model Performance

model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88]
model_b_scores = [0.88, 0.90, 0.89, 0.91, 0.87]

result = CrucibleBench.compare(model_a_scores, model_b_scores)
effect = CrucibleBench.effect_size(model_a_scores, model_b_scores)

Test Prompt Engineering

baseline_prompt = [0.72, 0.68, 0.75, 0.71, 0.69]
optimized_prompt = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.experiment(:ab_test,
  control: baseline_prompt,
  treatment: optimized_prompt,
  name: "Prompt Optimization"
)

Evaluate Architecture Changes

baseline = [0.85, 0.87, 0.84, 0.86, 0.88]
new_arch = [0.88, 0.90, 0.89, 0.91, 0.87]

result = CrucibleBench.compare(baseline, new_arch)
markdown = CrucibleBench.Export.to_markdown(result)
File.write!("results.md", markdown)

Ablation Studies

full_system = [0.85, 0.87, 0.84, 0.86, 0.88]
without_cache = [0.78, 0.76, 0.79, 0.77, 0.75]

result = CrucibleBench.experiment(:ablation,
  baseline: full_system,
  without_component: without_cache,
  component_name: "Response Cache"
)

Limitations

  • Sample Size: Most tests assume n ≥ 30 for asymptotic properties. Use bootstrap methods for smaller samples.
  • Normality: Parametric tests assume normality. Bench automatically suggests non-parametric alternatives when assumptions are violated.
  • Independence: All tests assume independent observations. Use appropriate designs for repeated measures.

References

Statistical Methods

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge.
  • Welch, B. L. (1947). The generalization of "Student's" problem when several different population variances are involved. Biometrika, 34(1-2), 28-35.
  • Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583-621.

AI Research Statistics

  • Dror, R., et al. (2018). The hitchhiker's guide to testing statistical significance in natural language processing. Proceedings of ACL.
  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.

Advanced Features

Bootstrap Confidence Intervals

For small samples or non-normal data, use bootstrap methods:

# Bootstrap CI for median (robust to outliers)
data = [0.85, 0.87, 0.84, 0.86, 0.88, 0.83, 0.89, 0.85]
ci = CrucibleBench.confidence_interval(data, :median,
  method: :bootstrap,
  iterations: 10000
)
# => %{interval: {0.835, 0.875}, method: :bootstrap, bootstrap_distribution: %{...}}

Multiple Effect Size Measures

# Compare different effect size calculations
cohens_d = Stats.EffectSize.cohens_d(group1, group2)
hedges_g = Stats.EffectSize.hedges_g(group1, group2)  # Bias-corrected
glass_delta = Stats.EffectSize.glass_delta(group1, group2)  # Control SD only

IO.puts("Cohen's d: #{cohens_d.cohens_d}")
IO.puts("Hedges' g: #{hedges_g.hedges_g}")
IO.puts("Glass's Δ: #{glass_delta.glass_delta}")

Power Analysis Curves

Calculate power for different sample sizes:

effect_size = 0.5
for n <- [20, 30, 50, 100] do
  power = CrucibleBench.power_analysis(:t_test,
    analysis_type: :post_hoc,
    effect_size: effect_size,
    n_per_group: n,
    alpha: 0.05
  )
  IO.puts("n=#{n}: power=#{Float.round(power.power * 100, 1)}%")
end

Complete API Reference

Core Functions

CrucibleBench.compare(group1, group2, opts \\\\ [])

Compares two independent groups with automatic test selection.

Options:

  • :test - Force specific test (:t_test, :welch_t_test, :mann_whitney)
  • :confidence_level - CI level (default: 0.95)
  • :check_assumptions - Test normality (default: true)
  • :alternative - :two_sided, :less, :greater

Returns: CrucibleBench.Result struct

CrucibleBench.compare_paired(group1, group2, opts \\\\ [])

Compares paired/related groups.

Options: Same as compare/3

CrucibleBench.compare_multiple(groups, opts \\\\ [])

Compares 3+ groups with ANOVA or Kruskal-Wallis.

Options:

  • :test - Force :anova or :kruskal_wallis
  • :check_assumptions - Test normality (default: true)

CrucibleBench.effect_size(group1, group2, opts \\\\ [])

Calculates Cohen's d effect size.

CrucibleBench.confidence_interval(data, statistic, opts \\\\ [])

Calculates confidence intervals.

Statistics: :mean, :median, :variance, etc. Methods: :analytical, :bootstrap

CrucibleBench.power_analysis(test_type, opts \\\\ [])

Power analysis calculations.

Types: :a_priori, :post_hoc Required: :effect_size, :alpha, :power or :n_per_group

Experiment DSL

CrucibleBench.experiment(:ab_test, opts)

Options:

  • :control - Control group data
  • :treatment - Treatment group data
  • :name - Experiment name

CrucibleBench.experiment(:ablation, opts)

Options:

  • :baseline - Full system performance
  • :without_component - Performance without component
  • :component_name - Name of removed component

CrucibleBench.experiment(:hyperparameter_sweep, opts)

Options:

  • :configurations - List of performance arrays
  • :labels - Configuration names

Export Functions

CrucibleBench.Export.to_markdown(result)

CrucibleBench.Export.to_latex(result)

CrucibleBench.Export.to_html(result)

CrucibleBench.Export.experiment_to_markdown(experiment_result)

Integration Examples

With Phoenix LiveView

defmodule StatsLive do
  use Phoenix.LiveView

  def handle_event("run_test", %{"control" => control, "treatment" => treatment}, socket) do
    result = CrucibleBench.compare(control, treatment)
    markdown = CrucibleBench.Export.to_markdown(result)

    {:noreply, assign(socket, result: result, markdown: markdown)}
  end
end

Research Workflow Integration

defmodule ResearchPipeline do
  def run_experiment(control_data, treatment_data, metadata) do
    # 1. Run statistical test
    result = CrucibleBench.compare(control_data, treatment_data)

    # 2. Check power
    power_analysis = CrucibleBench.power_analysis(:t_test,
      analysis_type: :post_hoc,
      effect_size: abs(result.effect_size.cohens_d),
      n_per_group: length(control_data),
      alpha: 0.05
    )

    # 3. Generate report
    report = CrucibleBench.Export.experiment_to_markdown(%{
      experiment_type: :ab_test,
      name: metadata.name,
      significant?: result.p_value < 0.05,
      p_value: result.p_value,
      effect_size: result.effect_size,
      power: power_analysis.power,
      # ... other fields
    })

    # 4. Save results
    File.write!("results/#{metadata.name}.md", report)

    {:ok, result, power_analysis}
  end
end

Benchmark Integration

defmodule BenchmarkRunner do
  def run_benchmarks(models, dataset) do
    results = for {name, model} <- models do
      scores = Enum.map(dataset, &model.predict/1)
      {name, scores}
    end

    # Statistical comparison of all models
    score_lists = Enum.map(results, fn {_name, scores} -> scores end)
    comparison = CrucibleBench.compare_multiple(score_lists)

    # Pairwise comparisons
    pairwise = for i <- 0..(length(results)-2),
                   j <- (i+1)..(length(results)-1) do
      {name_i, scores_i} = Enum.at(results, i)
      {name_j, scores_j} = Enum.at(results, j)

      result = CrucibleBench.compare(scores_i, scores_j)
      %{comparison: "#{name_i} vs #{name_j}",
        p_value: result.p_value,
        effect_size: result.effect_size.cohens_d}
    end

    %{omnibus: comparison, pairwise: pairwise}
  end
end

Performance Considerations

Memory Usage

  • Bootstrap methods with high iteration counts (>10,000) may consume significant memory
  • For large datasets, consider using analytical methods when assumptions are met
  • Effect size calculations are O(n) in sample size

Computational Complexity

Operation Complexity Notes
t-test O(n) Fast for any n
ANOVA O(kĂ—n) k = number of groups
Bootstrap CI O(iterations Ă— n) Expensive for high precision
Mann-Whitney O(n²) Slow for large n (>1000)
Kruskal-Wallis O(n log n) Better scaling

Optimization Tips

# Use analytical methods when possible
ci = CrucibleBench.confidence_interval(data, :mean, method: :analytical)

# Reduce bootstrap iterations for faster results
ci = CrucibleBench.confidence_interval(data, :median,
  method: :bootstrap,
  iterations: 1000  # Instead of default 10000
)

# Cache results for repeated analyses
@cached_power_analysis Memoize.memoize fn params ->
  CrucibleBench.power_analysis(params)
end

Troubleshooting

Common Issues

Non-significant results despite large differences

# Check if you have enough power
result = CrucibleBench.compare(group1, group2)
power = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: abs(result.effect_size.cohens_d),
  n_per_group: length(group1),
  alpha: 0.05
)

if power.power < 0.8 do
  IO.puts("Underpowered study! Need larger sample size.")
end

Assumption violations

# Check normality
result = CrucibleBench.compare(group1, group2, check_assumptions: true)
# If normality test fails, consider non-parametric tests

# Or manually check
skew1 = CrucibleBench.Stats.skewness(group1)
kurt1 = CrucibleBench.Stats.kurtosis(group1)

Outliers affecting results

# Use robust statistics
median_ci = CrucibleBench.confidence_interval(data, :median, method: :bootstrap)
# Compare with mean-based results

Error Messages

  • "Need at least 2 groups": compare_multiple/2 requires 2+ groups
  • "Unknown test: xyz": Invalid test type specified
  • "Sample size too small": Some tests require minimum n (e.g., normality tests)

Research Methodology

Best Practices Checklist

  • Power Analysis: Calculate required sample size before data collection
  • Effect Sizes: Always report alongside p-values
  • Assumptions: Test normality, homogeneity of variance
  • Multiple Testing: Apply corrections for multiple comparisons
  • Confidence Intervals: Report CIs, not just p-values
  • Replication: Design studies for reproducibility

Common Research Scenarios

Pre-registered Analysis Plan

# Define analysis plan before data collection
analysis_plan = %{
  primary_test: :welch_t_test,
  alpha: 0.05,
  power_target: 0.80,
  effect_size_estimate: 0.5,
  required_n: 64  # From a priori power analysis
}

# Execute plan
result = CrucibleBench.compare(group1, group2, test: analysis_plan.primary_test)

Exploratory Data Analysis

# Multiple effect sizes for robustness
effect_sizes = [
  CrucibleBench.effect_size(group1, group2),
  Stats.EffectSize.hedges_g(group1, group2),
  Stats.EffectSize.glass_delta(group1, group2)
]

# Sensitivity analysis with different tests
results = [
  CrucibleBench.compare(group1, group2, test: :welch_t_test),
  CrucibleBench.compare(group1, group2, test: :mann_whitney)
]

Meta-analysis Preparation

# Calculate effect sizes for meta-analysis
studies = [
  {study1_control, study1_treatment, "Study 1"},
  {study2_control, study2_treatment, "Study 2"}
]

meta_data = for {control, treatment, name} <- studies do
  effect = CrucibleBench.effect_size(control, treatment)
  result = CrucibleBench.compare(control, treatment)

  %{
    study: name,
    cohens_d: effect.cohens_d,
    variance: Stats.effect_size_variance(effect.cohens_d, length(control) + length(treatment)),
    n: length(control) + length(treatment)
  }
end

Contributing

Development Setup

# Clone and setup
git clone https://github.com/North-Shore-AI/crucible_bench.git
cd crucible_bench
mix deps.get

# Run tests
mix test

# Run examples
mix run examples/basic_usage.exs
mix run examples/advanced_usage.exs

# Generate docs
mix docs

Code Standards

  • Modules: Follow Elixir naming conventions
  • Functions: Clear, descriptive names with comprehensive documentation
  • Tests: Unit tests for all public functions, property-based tests where applicable
  • Documentation: Complete @doc and @moduledoc with examples

Adding New Tests

# 1. Implement test in appropriate stats module
defmodule CrucibleBench.Stats.NewTest do
  def test(group1, group2, opts \\ []) do
    # Implementation
    # Return CrucibleBench.Result struct
  end
end

# 2. Add to Analysis module
def compare_groups(group1, group2, opts) do
  # ... existing logic
  test_to_use = if new_condition, do: :new_test, else: existing_logic

  case test_to_use do
    :new_test -> NewTest.test(group1, group2, opts)
    # ... other cases
  end
end

# 3. Add comprehensive tests
test "new test handles various inputs" do
  # Test cases
end

Reporting Issues

Please include:

  • Elixir/Erlang versions
  • Sample data that reproduces the issue
  • Expected vs actual behavior
  • Full error messages and stack traces

License

MIT License - see LICENSE file for details

Changelog

v0.2.0 (Current)

  • Complete statistical testing framework with parametric and non-parametric coverage using accurate distribution functions
  • Expanded effect size suite with paired measures, eta/omega squared, and rank-biserial correlation plus interpretation guidance
  • Analytical and bootstrap confidence intervals and power analysis with actionable recommendations
  • High-level helpers for automatic test selection and experiment DSL for A/B tests, ablations, and hyperparameter sweeps
  • Publication-ready exports to Markdown, LaTeX, and HTML with standardized result metadata

v0.1.0

  • Initial release with comprehensive statistical testing framework
  • Support for parametric and non-parametric tests
  • Effect size calculations and power analysis
  • Bootstrap confidence intervals
  • Experiment DSL for common research patterns
  • Export to Markdown, LaTeX, and HTML formats
  • Complete documentation and examples