1D Linear Autoencoder Simulator Back
Machine Learning Simulator

1D Linear Autoencoder Simulator — Compression Equivalent to PCA

Compress a 10-dim signal through a K-dim bottleneck and decode it linearly. After SGD training a linear AE behaves like PCA on the top K principal components.

Parameters
Bottleneck dim K
Learning rate η
Iterations
steps
Noise σ

Input dim D = 10 fixed. The dataset is a 3-dim intrinsic signal plus noise, with N = 50 samples for training.

Results
Bottleneck dim K
Final reconstruction MSE
Explained variance (1 − MSE/Var)
Compression ratio K/D
Encode / bottleneck / reconstruction and learning curve

Top: input x (blue) / Middle: bottleneck z / Bottom: reconstruction x̂ (orange, overlaid on input) / Right: iterations vs MSE

Theory & Key Formulas

A 1D linear autoencoder compresses an input $x \in \mathbb{R}^D$ with the encoder matrix $W_e \in \mathbb{R}^{K \times D}$ into a low-dimensional latent code $z$, then decodes it back through $W_d \in \mathbb{R}^{D \times K}$.

Encoding, decoding, and the reconstruction loss:

$$z = W_e\,x, \qquad \hat{x} = W_d\,z = W_d\,W_e\,x$$ $$L = \lVert x - \hat{x} \rVert^2$$

SGD updates (learning rate $\eta$):

$$W_e \leftarrow W_e + 2\eta\,W_d^{\top}(x - \hat{x})\,x^{\top}$$ $$W_d \leftarrow W_d + 2\eta\,(x - \hat{x})\,z^{\top}$$

At convergence, $W_d W_e \approx V V^{\top}$ where $V$ holds the top $K$ principal components, so the linear AE acts as a projection onto the leading PCA subspace.

What is the 1D Linear Autoencoder Simulator?

🙋
An autoencoder is just a network that copies its input to its output, right? What is interesting about that?
🎓
Roughly, the trick is the narrow "bottleneck" layer in the middle. For example we train the network to compress a 10-dim input into just 3 dims and then expand it back to 10. If the data really used all 10 degrees of freedom we could not recover it; but if the data actually lies on a 3-dim thin subspace, the reconstruction is nearly perfect. Press "Retrain" with K = 3 in the simulator and you should see the explained variance go above 99%.
🙋
I see! When I drop K to 1 the explained variance plummets. Does that mean cramming 3-dim information into a single number is impossible?
🎓
Exactly. Information theory says that representing an intrinsically 3-dim signal with a single number throws away two thirds of the variance, and that lost variance shows up as reconstruction error. For a linear AE the optimal solution is provably equal to the projection onto the top K principal components — so a linear AE is essentially PCA.
🙋
Wait, if it equals PCA, why bother with an autoencoder at all?
🎓
Great question. Linearly, yes, PCA is enough. But the moment you insert non-linear activations (ReLU, tanh, etc.) in the encoder and decoder, you can compress along a curved manifold. MNIST digits, for instance, live on a thin curved manifold inside a 784-dim space that a flat subspace cannot capture. In practice deep autoencoders are used for image compression, anomaly detection, denoising and generative modelling (VAEs).
🙋
When I raise the noise slider the explained variance falls even with K = 3. What is happening?
🎓
That is the "noise floor". The variance from the true 3-dim component is fixed, and added noise raises the total variance. A K = 3 autoencoder can only capture the true component, so the noise remains as reconstruction error. The theoretical reconstruction MSE is about $\sigma^2$. Conversely this means the autoencoder is automatically removing noise — the basis of denoising autoencoders.

Frequently Asked Questions

Too large a learning rate makes gradient descent overshoot the loss valley, causing the MSE to grow or oscillate. A linear AE minimizes a quadratic loss, so extreme η can diverge and produce NaN. In this simulator η = 0.05 is the default; values above about 0.3 may show diverging behaviour. If training becomes unstable, bring η back to the 0.01–0.05 range.
Because SGD is stochastic, too few iterations leave the optimum unreached. In this simulator, reducing iterations to 10 makes the explained variance drop sharply even with K = 3. About 500–1000 iterations is a reasonable target — check that the learning curve has flattened. A linear AE also has a closed-form SVD solution, but here we train by SGD to mirror deep-learning practice.
Not necessarily. The optimum of a linear AE pins down only the subspace spanned by the top K principal components; within that subspace any rotation is allowed. Many different (W_e, W_d) pairs realise the same projection. To recover the actual principal components you must post-process by diagonalising the covariance of W_e·X.
Increasing K beyond 3 (to 5 or 9) will not significantly lower the reconstruction MSE: there is only 3-dim structure plus noise, so the extra dimensions merely start absorbing noise. An over-large bottleneck can lead to "memorising the noise" and hurt generalisation to new data. This is one reason why keeping K small acts as regularisation in autoencoder design.

Real-world Applications

Anomaly detection: When an autoencoder is trained only on normal data, normal inputs reconstruct with low error and anomalies stand out by their high error. This pattern is widely used for vibration data on production lines, medical imaging and network logs — situations where labelled anomalies are scarce. The "intrinsic dimension plus noise" structure illustrated here is the foundation.

Image and video compression: Learned image compression with deep autoencoders is being studied as a successor to JPEG/MPEG. The encoder turns an image into a compact latent representation that the decoder restores. The linear AE is the starting point and is mathematically equivalent to PCA compression and to the Eigenface method for face recognition.

Representation learning and feature extraction: Autoencoders learn meaningful features from large unlabelled corpora and are then used as a pre-training step for downstream classification or regression. Modern self-supervised models such as BERT and GPT inherit, in a broad sense, the autoencoder idea of "predicting parts of the input".

Foundation for generative models (VAE, diffusion): Variational autoencoders (VAEs) extend the idea by giving the latent space a probabilistic structure so that new samples can be generated. Recent diffusion models are also closely connected to autoencoder-style thinking in a latent space. The linear AE is the most basic building block under all of these.

Common Misconceptions and Pitfalls

The most common misconception is to assume that "autoencoders can magically compress any information". Information theory imposes an upper limit on how much information can pass through a K-dim bottleneck, and any K below the intrinsic dimension necessarily loses information. Setting K = 1 or K = 2 in this simulator immediately shows the explained variance collapse. "Estimating the intrinsic dimension of the data" is the proper starting point for autoencoder design.

The next pitfall is to think of linear AEs and PCA as separate methods. As shown above, the optimum of a linear AE trained with MSE coincides with the projection onto the top K principal components of PCA. So "why use a linear AE" comes down to its pedagogical role as a gateway to non-linear AEs and to the scalability of SGD on huge datasets. For raw data, SVD-based PCA is often computationally sufficient.

Finally, "pushing reconstruction error to zero is always desirable" is dangerous. With noisy data, driving the error to zero forces the network to memorise the noise and ruins generalisation. If you keep σ at 0.5 and raise K in this simulator, the error keeps shrinking but "noise memorisation" sets in. In practice you must choose the bottleneck dimension, learning rate and iterations carefully, and monitor reconstruction error on a held-out validation set.