What does Pearson's chi-squared goodness-of-fit test decide?

It decides whether observed category counts O_i are consistent with a hypothesized distribution that produces expected counts E_i (uniform, Mendelian 9:3:3:1, binomial, etc.). The statistic is χ² = Σ_i (O_i − E_i)² / E_i with df = k − r − 1 (k categories, r estimated parameters); for a simple goodness-of-fit test df = k − 1. If χ² exceeds the critical value χ²_α(df), we reject the null. This tool covers k = 4 categories and the uniform null E_i = N/4.

Why is the 5% critical value 7.815 for df = 3?

The chi-squared density is f(χ²|df) = χ^(df/2−1) e^(−χ²/2) / (2^(df/2) Γ(df/2)). For df = 3 the right-tail integral equals 0.05 at χ² ≈ 7.8147 (standard tables list χ²_0.05(3) = 7.815). The tool computes this via numerical integration (Wilson–Hilferty plus Newton iteration) and shows the green accept and red reject regions accordingly. For reference: df=1 → 3.841, df=2 → 5.991, df=4 → 9.488. Larger df shifts the critical value upward.

How does sample size N change the result?

When you change a slider, N = ΣO_i and E_i = N/4 both shift. For identical proportions, χ² grows linearly with N: a (6,5,4,5) sample (N=20) gives χ² ≈ 0.4, but the same proportions at N=1000 give χ² ≈ 20 — a clear rejection. Larger N increases statistical power and lets you detect tiny deviations. Conversely, very small N (rule of thumb: any E_i < 5) makes the chi-squared approximation unreliable; switch to Fisher's exact test or the G-test in that regime.

Where is goodness-of-fit testing actually used?

Common examples: (1) genetics — testing F2 phenotype ratios against Mendel's 9:3:3:1; (2) testing whether a die or random-number generator follows a uniform distribution; (3) verifying that call-center arrivals are uniform across days; (4) quality control to monitor whether the mix of defect types stays stable; (5) marketing A/B tests on equal click rates. In CAE/engineering, applications include mesh-quality histograms, uniformity tests for Monte-Carlo random streams, and defect-mode share monitoring on production lots.

Chi-Squared Goodness-of-Fit Test Simulator — Pearson's Test

Q: Where is goodness-of-fit testing actually used?

Common examples: (1) genetics — testing F2 phenotype ratios against Mendel's 9:3:3:1; (2) testing whether a die or random-number generator follows a uniform distribution; (3) verifying that call-center arrivals are uniform across days; (4) quality control to monitor whether the mix of defect types stays stable; (5) marketing A/B tests on equal click rates. In CAE/engineering, applications include mesh-quality histograms, uniformity tests for Monte-Carlo random streams, and defect-mode share monitoring on production lots.

Parameters

Observed count O_1

Observed count O_2

Observed count O_3

Observed count O_4

Expected counts use the uniform null: E_i = (O_1+O_2+O_3+O_4)/4. Defaults (30, 25, 20, 25) give χ² = 2.000, df = 3, 5% critical = 7.815 and the verdict 'Accept H₀'.

Results

—

χ² statistic

—

Degrees of freedom df

—

5% critical value

—

Decision

Observed vs expected counts

Blue bars are the observed counts O_i; red bars are the expected counts E_i (constant under the uniform null). Squared blue–red gaps drive each term of χ².

Chi-squared distribution f(χ²|df=3) and rejection region

x axis: χ² ∈ [0, 20]; y axis: probability density. Green = accept region, red = reject region (χ² ≥ 7.815), yellow marker = current χ². When the marker enters red, we reject H₀.

Theory & Key Formulas

For $k$ categories with observed counts $O_i$ and expected counts $E_i$, Pearson's chi-squared statistic is

$$\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$$

Goodness-of-fit degrees of freedom are $df = k - 1$ (or $df = k - r - 1$ when $r$ parameters are estimated). Under the uniform null the expected count is $E_i = N/k$ with $N = \sum O_i$.

We reject $H_0$ when $\chi^2 \ge \chi^2_\alpha(df)$. For $df = 3$ the 5% upper-tail value is $\chi^2_{0.05}(3) \approx 7.815$. With defaults $(30,25,20,25)$ we get $\chi^2 = 25/25 + 0 + 25/25 + 0 = 2.000 < 7.815$, so $H_0$ is accepted.

What is the Chi-Squared Goodness-of-Fit Test Simulator

🙋

If I roll a die 100 times and see (1,2,3,4) appear (30, 25, 20, 25) times, can I really call it 'fair'? It does feel a bit off.

🎓

Pearson's chi-squared goodness-of-fit test answers exactly that. Under 'fair' the expected count is E_i = 100/4 = 25 for every category. The statistic χ² = Σ(O−E)²/E sums squared deviations divided by E, here (5²+0²+5²+0²)/25 = 50/25 = 2.000. With df = k−1 = 3 the 5% critical value of the chi-squared distribution is 7.815, and 2.000 < 7.815, so we cannot reject H₀ — the deviation is consistent with chance.

🙋

What if the data were heavily skewed, say (50, 25, 0, 25)?

🎓

Try the sliders. E is still 25; χ² = 25²/25 + 0 + 25²/25 + 0 = 50, way above 7.815. The verdict flips to 'Reject H₀' because such a skew under fair odds has p ≈ 10⁻¹⁰. The yellow marker on the chi-squared plot jumps into the red rejection region.

🙋

How big a deal is the sample size?

🎓

Huge. Hold the proportions at (0.30, 0.25, 0.20, 0.25). Then χ² scales linearly with N: (6,5,4,5) at N=20 gives χ²≈0.4 (no concern), (30,25,20,25) at N=100 gives χ²=2.0, but the same proportions at N=1000 give χ²=20 — a clear rejection. Power grows with N, but very small N (any E_i<5) breaks the chi-squared approximation; switch to Fisher's exact test or G-test in that regime.

🙋

Why is df = 3 and not 4 when there are 4 categories?

🎓

Because N = ΣO_i is fixed, the four counts cannot all vary independently — only three can. So df = k−1 = 3. If you also estimate r extra parameters from the data, df drops to k−r−1. With our uniform null no parameter is estimated, so df = 3 is correct, and that pins the critical value at 7.815.

FAQ

A common rule of thumb requires E_i ≥ 5 for every category for the chi-squared approximation to be reliable. When that fails you have three options: (1) merge adjacent categories to reduce k (and df), (2) collect more data to raise N, or (3) switch to Fisher's exact test or the G-test (likelihood-ratio statistic). This tool has fixed k = 4, so as long as O_i ≥ 1 you have E_i ≥ 1; aim for total counts around 20+ to satisfy the E_i ≥ 5 guideline.

The statistic χ² = Σ(O−E)²/E is the same, but the purpose and the way E is computed differ. A goodness-of-fit test asks whether a one-way categorical distribution matches a hypothesized distribution; E comes directly from theory (the uniform null here). A test of independence asks whether two categorical variables are independent in a two-way contingency table; E_ij = (row_i total × col_j total) / grand total, with df = (r−1)(c−1). For example, gender × smoking is an independence test, while die-face frequencies use the goodness-of-fit form.

The four stat cards summarize χ², df, the 5% critical value, and the verdict. Comparing χ² to the critical value gives the same accept/reject information as comparing p to 0.05. Internally the code computes p exactly: χ² = 2.000 → p ≈ 0.572, χ² = 7.815 → p = 0.05, χ² = 11.345 → p ≈ 0.01. To verify, use scipy.stats.chi2.sf or R's pchisq(q, df, lower.tail = FALSE).

Useful applications include: (1) validating mesh-quality histograms (aspect ratio, skew angle) against an expected distribution, (2) testing whether a Monte-Carlo random-number generator (Mersenne Twister, Halton, etc.) really produces a uniform sample, (3) monitoring whether the share of defect modes — crack vs void vs dimension fault — stays stable across production lots, and (4) augmenting SPC control charts to detect non-random defect patterns. Pearson's χ² is computationally cheap and trivial to implement in Excel, so it is a daily tool for quality engineers.

Real-world applications

Mendelian-ratio testing in genetics: The classic use case asks whether observed F2 phenotypes follow the 9:3:3:1 ratio. With observed pea-cross counts (312, 110, 102, 36; total 560) and expected (315, 105, 105, 35) we get χ² = 9/315 + 25/105 + 9/105 + 1/35 ≈ 0.382, df = 3, p ≈ 0.94 — an excellent fit. Goodness-of-fit reporting is the standard in genetics journals.

Uniformity testing of random-number generators (CAE Monte Carlo): For probabilistic FEM and reliability analysis, draws from Mersenne Twister or Halton sequences are binned and a chi-squared test checks uniformity. Extending this tool's 4 bins to 10 gives df = 9 and a 5% critical value of 16.92. Heavyweight test suites such as DIEHARD and TestU01 still use χ² as a building block alongside many other statistics.

Manufacturing quality control: Defects in a shift are sorted into {dimensional, surface, assembly, other}. Comparing observed counts against an expected mix from historical data (40:25:20:15, say) flags shifts in defect patterns. A rejection prompts a root-cause investigation — material change, process drift, operator drift. Used alongside SPC charts, it is a routine QC tool.

Marketing / A/B testing: An e-commerce page rotates four banner designs and observes click counts (45, 38, 52, 41). The null 'all four designs have equal click rate' is tested with χ². Rejection means at least one design differs significantly; multiple comparison procedures and confidence intervals then identify which one and how much.

Common misconceptions and caveats

The most common mistake is treating 'larger χ² = larger effect' at face value. χ² grows roughly linearly with sample size N, so χ² = 8 at N = 10 means something very different from χ² = 8 at N = 10000 — the latter merely 'detected' a tiny deviation. Always pair p-values with an effect size such as Cramér's V or Cohen's w to characterize the magnitude of the discrepancy. This tool spans small-to-medium totals (4 to 400), making the N effect easy to feel.

Next, never forget the independence assumption. Pearson's χ² requires independent observations. Repeated measures on the same subject (before/after) need McNemar's test or Cochran's Q; clustered data (same family, machine, or production lot) inflate apparent significance, so χ² will look 'too significant' if the cluster effect is ignored.

Finally, beware the arbitrariness of category count k and binning. When discretizing a continuous variable into bins for a goodness-of-fit test, the bin width and bin count change χ² substantially. Sturges's rule and the Freedman–Diaconis rule provide data-driven defaults, but always document the binning choice when reporting results. With k = 4 fixed in this tool you do not face this issue, but it is a common blind spot in practice.

Chi-Squared Goodness-of-Fit Test Simulator — Pearson's Test

What is the Chi-Squared Goodness-of-Fit Test Simulator

FAQ

Real-world applications

Common misconceptions and caveats

Related Tools