PCA Simulator Back
Statistics & Multivariate Analysis

Principal Component Analysis (PCA) Simulator — Eigendecomposition for 2D Data

Visualize Principal Component Analysis on correlated 2D data. Compute principal axes and explained-variance ratios from the covariance-matrix eigendecomposition as you vary correlation and standard deviations.

Parameters
Correlation ρ
Std deviation σ_x
Std deviation σ_y
Sample size N
pts

Data is generated by a fixed-seed LCG plus Box-Muller transform — the same parameters always produce the same point cloud.

Results
PC1 eigenvalue λ₁
PC2 eigenvalue λ₂
PC1 explained variance
PC1 axis angle
Scatter plot with principal axes

Grey dots = data points / White × = mean / Red arrow = PC1 axis (length √λ₁) / Blue arrow = PC2 axis / Green ellipse = 1σ covariance ellipse

Theory & Key Formulas

Given 2D data X = {(x_i, y_i)}, first center it by subtracting the mean and form the sample covariance matrix.

2×2 covariance matrix; σ_xx, σ_yy are variances, σ_xy is the covariance:

$$C = \begin{bmatrix} \sigma_{xx} & \sigma_{xy} \\ \sigma_{xy} & \sigma_{yy} \end{bmatrix}, \quad \sigma_{xy} = \frac{1}{n-1}\sum_i (x_i-\bar{x})(y_i-\bar{y})$$

Closed-form 2×2 eigenvalues; T = trace, D = determinant:

$$\lambda_{1,2} = \frac{T \pm \sqrt{T^2 - 4D}}{2}, \quad T = \sigma_{xx}+\sigma_{yy}, \quad D = \sigma_{xx}\sigma_{yy} - \sigma_{xy}^2$$

Explained variance ratio of the i-th principal component:

$$r_i = \frac{\lambda_i}{\lambda_1 + \lambda_2}$$

The eigenvector $v_i$ satisfies $(C - \lambda_i I) v = 0$ and points along the dominant variation direction. PC1 is the direction of maximum variance.

What is the PCA Simulator?

🙋
I keep hearing about PCA but what is it actually doing? I just see two arrows on a scatter plot…
🎓
In a nutshell, PCA finds the direction along which the data spreads the most. Set the correlation ρ slider above to 0.8 — the cloud stretches into an ellipse pointing up-right. That stretching direction is the red arrow, the first principal component (PC1). It is the axis of maximum variance in the data.
🙋
And the blue arrow?
🎓
That is the second principal component, orthogonal to the first. In 2D you can only get two of them. Check the explained variance card: with ρ=0.8, σ_x=2, σ_y=1, PC1 alone captures roughly 94% of the total variance. That is the power of dimensionality reduction — even in 100-dimensional data, the top 2-3 components often explain more than 80%.
🙋
When I set correlation to 0, the arrows snap to the x and y axes. Is that automatic?
🎓
Yes. At ρ=0 with σ_x ≠ σ_y, the covariance matrix is diagonal, so the eigenvectors coincide with the standard axes. PC1 is whichever axis has the larger variance — so with σ_x=2 and σ_y=1, PC1 points along the x-axis. But if you set σ_x = σ_y and ρ=0, the matrix becomes a scalar multiple of identity and the principal directions become degenerate — they cannot be uniquely determined.
🙋
Reducing N to 10 makes the arrows and ellipse jitter quite a bit.
🎓
Nice observation. That is sampling noise. For ρ=0.8, σ_x=2, σ_y=1 the true eigenvalues are λ₁=4.69 and λ₂=0.31, but the sample estimates wobble at small N. A common rule of thumb is to have at least 10× as many samples as variables. Push N up to 200 and the sample estimates lock onto the theoretical values.

Frequently Asked Questions

Image recognition (eigenfaces), natural language processing (latent semantic analysis), finance (factor extraction from stock returns), biology (gene expression visualization), quality control (multivariate control charts), and machine learning preprocessing (dimensionality reduction, denoising). Whenever the data is high-dimensional and the variables are correlated, PCA is one of the first methods to try. It has over a century of history and remains the most basic multivariate analysis technique.
They are mathematically equivalent. Decomposing the centered data matrix X_c = U Σ V^T yields the principal components as columns of V and the eigenvalues as the diagonal of Σ²/(n-1). In practice the SVD-based implementation is preferred — it avoids forming the covariance matrix and is more stable, especially when the number of variables p exceeds the sample size n. scikit-learn's PCA uses SVD internally.
Dividing each principal component score by √λ_i so that every component has unit variance. The result behaves like uncorrelated white noise. This is used as preprocessing for Independent Component Analysis (ICA), Gaussian processes, and neural network input normalization. However, dividing by very small eigenvalues is numerically unstable, so components below a threshold are discarded, or a small regularization term is added (ε-whitening).
Standardize first — convert each variable to zero mean and unit variance before applying PCA. For example, mixing height in cm, weight in kg, and age in years would let height dominate PC1 just because its numerical values are large. After standardization all variables contribute on equal footing. This is sometimes called "PCA on the correlation matrix" as opposed to "PCA on the covariance matrix". Standardization may be unnecessary when all variables share the same physical units.

Real-World Applications

Image recognition (Eigenfaces): Proposed by Turk and Pentland in the early 1990s, Eigenfaces is the most famous PCA application. Treating face images as high-dimensional vectors and applying PCA reveals dominant variation patterns (lighting, hairstyle, expression). Recognition is performed by projecting a query face onto the principal components and measuring distance to registered faces. It was the standard method before deep learning.

Gene expression visualization: PCA on the tens-of-thousands-of-dimensional expression vectors from microarrays or RNA-seq produces 2-3D scatter plots of samples (patients, tissues, conditions). It is the standard "first plot" in bioinformatics papers — useful for cancer subtype clustering, drug response comparison, batch effect diagnosis, and more.

Financial factor models: PCA on the covariance matrix of stock returns typically gives a PC1 reflecting overall market movement, PC2 capturing sector-relative movement, PC3 capturing size effects, and so on. This forms the basis of portfolio optimization and risk modeling.

Quality control and anomaly detection: Apply PCA to multivariate sensor data from a manufacturing process and keep only the top components as a "normal state" model. When new data arrive, large values of the Hotelling T² statistic (distance within the PC subspace) or the Q statistic (projection onto discarded components) flag anomalies. This is in production in semiconductor and chemical plants.

Common Pitfalls and Notes

The most common misconception is to assume principal components must have physical meaning. PCs are defined purely mathematically as directions of maximum variance and do not necessarily admit a clean interpretation. Sometimes PC1 of a customer survey conveniently represents "overall satisfaction" and PC2 a "price vs quality tradeoff", but often they are not interpretable. Always check the loadings (coefficients on the original variables) to understand which linear combination each PC represents.

Next, trying to use PCA on nonlinear structure is a frequent failure mode. PCA is a linear method, so data arranged on a curved manifold (e.g. a Swiss roll) cannot be captured by principal axes. In the simulator, when ρ approaches 0 and σ_x ≈ σ_y the principal direction becomes ambiguous. For ring-shaped or complex clustered data, consider Kernel PCA, t-SNE, UMAP, or autoencoders. "PCA then ML" is an excellent baseline but is not universal.

Finally, do not underestimate the gap between sample and population values. Even in this simulator, eigenvalues computed from N=40 samples deviate by a few percent from the theoretical values (λ₁=4.69, λ₂=0.31), and the deviation grows further at N=10. In practice, when deciding how many components to retain or whether an observed explained variance is statistically significant, bootstrap or cross-validation should be used to obtain confidence intervals for the eigenvalues. PCA on small samples should report uncertainty alongside the point estimates.