Chi-Squared Goodness-of-Fit Test Simulator — Pearson's Test
Adjust the four observed counts O_i and run Pearson's chi-squared goodness-of-fit test against a uniform-distribution null in real time. The tool reports χ² = Σ(O−E)²/E, df, the 5% critical value, and the accept/reject decision, with histogram and chi-squared distribution visualization.
Parameters
Observed count O_1
Observed count O_2
Observed count O_3
Observed count O_4
Expected counts use the uniform null: E_i = (O_1+O_2+O_3+O_4)/4. Defaults (30, 25, 20, 25) give χ² = 2.000, df = 3, 5% critical = 7.815 and the verdict 'Accept H₀'.
Results
—
χ² statistic
—
Degrees of freedom df
—
5% critical value
—
Decision
Observed vs expected counts
Blue bars are the observed counts O_i; red bars are the expected counts E_i (constant under the uniform null). Squared blue–red gaps drive each term of χ².
Chi-squared distribution f(χ²|df=3) and rejection region
x axis: χ² ∈ [0, 20]; y axis: probability density. Green = accept region, red = reject region (χ² ≥ 7.815), yellow marker = current χ². When the marker enters red, we reject H₀.
Theory & Key Formulas
For $k$ categories with observed counts $O_i$ and expected counts $E_i$, Pearson's chi-squared statistic is
Goodness-of-fit degrees of freedom are $df = k - 1$ (or $df = k - r - 1$ when $r$ parameters are estimated). Under the uniform null the expected count is $E_i = N/k$ with $N = \sum O_i$.
We reject $H_0$ when $\chi^2 \ge \chi^2_\alpha(df)$. For $df = 3$ the 5% upper-tail value is $\chi^2_{0.05}(3) \approx 7.815$. With defaults $(30,25,20,25)$ we get $\chi^2 = 25/25 + 0 + 25/25 + 0 = 2.000 \lt 7.815$, so $H_0$ is accepted.
What is the Chi-Squared Goodness-of-Fit Test Simulator
🙋
If I roll a die 100 times and see (1,2,3,4) appear (30, 25, 20, 25) times, can I really call it 'fair'? It does feel a bit off.
🎓
Pearson's chi-squared goodness-of-fit test answers exactly that. Under 'fair' the expected count is E_i = 100/4 = 25 for every category. The statistic χ² = Σ(O−E)²/E sums squared deviations divided by E, here (5²+0²+5²+0²)/25 = 50/25 = 2.000. With df = k−1 = 3 the 5% critical value of the chi-squared distribution is 7.815, and 2.000 < 7.815, so we cannot reject H₀ — the deviation is consistent with chance.
🙋
What if the data were heavily skewed, say (50, 25, 0, 25)?
🎓
Try the sliders. E is still 25; χ² = 25²/25 + 0 + 25²/25 + 0 = 50, way above 7.815. The verdict flips to 'Reject H₀' because such a skew under fair odds has p ≈ 10⁻¹⁰. The yellow marker on the chi-squared plot jumps into the red rejection region.
🙋
How big a deal is the sample size?
🎓
Huge. Hold the proportions at (0.30, 0.25, 0.20, 0.25). Then χ² scales linearly with N: (6,5,4,5) at N=20 gives χ²≈0.4 (no concern), (30,25,20,25) at N=100 gives χ²=2.0, but the same proportions at N=1000 give χ²=20 — a clear rejection. Power grows with N, but very small N (any E_i<5) breaks the chi-squared approximation; switch to Fisher's exact test or G-test in that regime.
🙋
Why is df = 3 and not 4 when there are 4 categories?
🎓
Because N = ΣO_i is fixed, the four counts cannot all vary independently — only three can. So df = k−1 = 3. If you also estimate r extra parameters from the data, df drops to k−r−1. With our uniform null no parameter is estimated, so df = 3 is correct, and that pins the critical value at 7.815.
FAQ
A common rule of thumb requires E_i ≥ 5 for every category for the chi-squared approximation to be reliable. When that fails you have three options: (1) merge adjacent categories to reduce k (and df), (2) collect more data to raise N, or (3) switch to Fisher's exact test or the G-test (likelihood-ratio statistic). This tool has fixed k = 4, so as long as O_i ≥ 1 you have E_i ≥ 1; aim for total counts around 20+ to satisfy the E_i ≥ 5 guideline.
The statistic χ² = Σ(O−E)²/E is the same, but the purpose and the way E is computed differ. A goodness-of-fit test asks whether a one-way categorical distribution matches a hypothesized distribution; E comes directly from theory (the uniform null here). A test of independence asks whether two categorical variables are independent in a two-way contingency table; E_ij = (row_i total × col_j total) / grand total, with df = (r−1)(c−1). For example, gender × smoking is an independence test, while die-face frequencies use the goodness-of-fit form.
The four stat cards summarize χ², df, the 5% critical value, and the verdict. Comparing χ² to the critical value gives the same accept/reject information as comparing p to 0.05. Internally the code computes p exactly: χ² = 2.000 → p ≈ 0.572, χ² = 7.815 → p = 0.05, χ² = 11.345 → p ≈ 0.01. To verify, use scipy.stats.chi2.sf or R's pchisq(q, df, lower.tail = FALSE).
Useful applications include: (1) validating mesh-quality histograms (aspect ratio, skew angle) against an expected distribution, (2) testing whether a Monte-Carlo random-number generator (Mersenne Twister, Halton, etc.) really produces a uniform sample, (3) monitoring whether the share of defect modes — crack vs void vs dimension fault — stays stable across production lots, and (4) augmenting SPC control charts to detect non-random defect patterns. Pearson's χ² is computationally cheap and trivial to implement in Excel, so it is a daily tool for quality engineers.
Real-world applications
Mendelian-ratio testing in genetics: The classic use case asks whether observed F2 phenotypes follow the 9:3:3:1 ratio. With observed pea-cross counts (312, 110, 102, 36; total 560) and expected (315, 105, 105, 35) we get χ² = 9/315 + 25/105 + 9/105 + 1/35 ≈ 0.382, df = 3, p ≈ 0.94 — an excellent fit. Goodness-of-fit reporting is the standard in genetics journals.
Uniformity testing of random-number generators (CAE Monte Carlo): For probabilistic FEM and reliability analysis, draws from Mersenne Twister or Halton sequences are binned and a chi-squared test checks uniformity. Extending this tool's 4 bins to 10 gives df = 9 and a 5% critical value of 16.92. Heavyweight test suites such as DIEHARD and TestU01 still use χ² as a building block alongside many other statistics.
Manufacturing quality control: Defects in a shift are sorted into {dimensional, surface, assembly, other}. Comparing observed counts against an expected mix from historical data (40:25:20:15, say) flags shifts in defect patterns. A rejection prompts a root-cause investigation — material change, process drift, operator drift. Used alongside SPC charts, it is a routine QC tool.
Marketing / A/B testing: An e-commerce page rotates four banner designs and observes click counts (45, 38, 52, 41). The null 'all four designs have equal click rate' is tested with χ². Rejection means at least one design differs significantly; multiple comparison procedures and confidence intervals then identify which one and how much.
Common misconceptions and caveats
The most common mistake is treating 'larger χ² = larger effect' at face value. χ² grows roughly linearly with sample size N, so χ² = 8 at N = 10 means something very different from χ² = 8 at N = 10000 — the latter merely 'detected' a tiny deviation. Always pair p-values with an effect size such as Cramér's V or Cohen's w to characterize the magnitude of the discrepancy. This tool spans small-to-medium totals (4 to 400), making the N effect easy to feel.
Next, never forget the independence assumption. Pearson's χ² requires independent observations. Repeated measures on the same subject (before/after) need McNemar's test or Cochran's Q; clustered data (same family, machine, or production lot) inflate apparent significance, so χ² will look 'too significant' if the cluster effect is ignored.
Finally, beware the arbitrariness of category count k and binning. When discretizing a continuous variable into bins for a goodness-of-fit test, the bin width and bin count change χ² substantially. Sturges's rule and the Freedman–Diaconis rule provide data-driven defaults, but always document the binning choice when reporting results. With k = 4 fixed in this tool you do not face this issue, but it is a common blind spot in practice.
Enter observed counts O₁, O₂, O₃, O₄ in the sliders (integer values, typically 10–100 per category). These represent frequency data from your sample.
The simulator assumes equal expected frequencies: E_i = (ΣO_i)/4 for each category under the null hypothesis H₀.
Read χ² = Σ(O_i − E_i)²/E_i, degrees of freedom df = 3, and compare χ² against the 5% critical value (7.815). If χ² ≥ 7.815, reject H₀; otherwise fail to reject.
Worked Example
A manufacturing process produces four product variants. Observed counts in one shift: O₁=28, O₂=32, O₃=25, O₄=35 (total n=120). Expected count per variant: E_i=30. Chi-squared = (28−30)²/30 + (32−30)²/30 + (25−30)²/30 + (35−30)²/30 = 0.133 + 0.133 + 0.833 + 0.833 = 1.933. With df=3, χ²(0.05)=7.815. Since 1.933 < 7.815, fail to reject H₀: observed distribution matches expected uniform distribution at 5% significance level.
Practical Notes
Minimum expected frequency rule: E_i ≥ 5 in each category. If any E_i < 5, combine categories or collect more data; violating this invalidates the test.
Use this test for quality control, genetic ratios (Mendelian inheritance), or survey response distributions. Not suitable for continuous data without binning.
Critical value 7.815 applies only when df=3. If you add/remove categories, df changes and critical value must be updated from χ² tables.