NovaSolver›Mutual Information (MI) Estimation Simulator Back
Information Theory
Mutual Information (MI) Estimation Simulator
An interactive playground for I(X;Y), the information-theoretic measure of dependence that captures both linear and non-linear relations. Compare four estimators — KSG (k-NN), histogram binning, KDE and the closed-form Gaussian solution — and watch their bias and confidence intervals respond live to sample size N and correlation ρ.
Parameters
Sample size N
Draws from the bivariate Gaussian
Correlation ρ
Pearson correlation between X and Y. ±1 = full dependence
Estimator
Algorithm used to estimate MI from samples
Neighbours k (KSG)
Number of neighbours for KSG. 3-5 is typical
Bins (Binning)
Per-axis bin count for histogram binning
Joint entropy baseline
Reference H(X,Y) scale in nats
Results
—
Exact MI (nats)
—
Exact MI (bits)
—
Estimated MI
—
Estimator bias
—
95% CI width
—
Relative error (%)
—
Scatter (X, Y) and joint distribution
Samples drawn from a bivariate Gaussian (μ=0, σ=1, correlation ρ) with marginal histograms on each axis. Moving the ρ slider tilts the cloud and changes MI.
Definition of mutual information for continuous variables, and the closed-form solution for the bivariate Gaussian. Use log₂ for bits, ln for nats. I = 0 iff X, Y are independent; I → ∞ for full dependence.
The original Kraskov-Stögbauer-Grassberger (2004) k-NN estimator. ψ is the digamma function and n_x, n_y are marginal neighbour counts. This tool uses a Gaussian-calibrated closed form for bias and variance.
Mutual Information (MI) Estimation — Information Theory & Feature Selection
🙋
Is "mutual information" basically a fancier version of the correlation coefficient? When does plain ρ actually fall short?
🎓
Nice place to start. Pearson's ρ only sees straight-line relationships. Take Y = X²: whether X is small or large, Y is large, but the average correlation is zero. Clearly Y is determined by X, right? Mutual information I(X;Y) catches exactly those non-linear ties. Formally it's the KL divergence between the joint p(x,y) and the product p(x)p(y) — "how far is the joint from being independent?" Linear or non-linear, discrete or continuous, MI doesn't care, and that's its real power.
🙋
Then why does most applied work still use correlation instead of MI?
🎓
Simple answer: MI is hard to estimate. ρ is one formula away, but MI requires you to estimate the joint p(x,y) from finite data. A naive histogram blows up with a bias of (B²−1)/(2N) as you add bins — that's the classical Miller-Madow correction. Kraskov, Stögbauer and Grassberger fixed much of this in 2004 with KSG, which uses the k-th nearest-neighbour distance per point and crushes bias. Switch "Estimator" on the left and you'll see the same ρ = 0.6 produces noticeably different estimates and biases.
🙋
Right, the KSG → binning switch immediately makes the bias balloon. With k = 5 fixed and N going from 500 to 5000 the confidence interval shrinks a lot. Can researchers just keep cranking N up?
🎓
In theory yes, in practice the fields where MI is most needed are exactly the ones where N is tiny. fMRI gives you hundreds of voxels per session, gene expression maybe a few dozen samples. So "how do I get a good MI estimate from very few samples" is a research field of its own. The standard toolkit is: (1) KSG with k = 3-5, (2) shuffle tests for significance, (3) bootstrap CIs. Look at the "Estimation error vs N" chart — error falls roughly as 1/√N.
🙋
Could you spell out where MI really earns its keep? I've heard it's used for feature selection in machine learning.
🎓
The canonical example is mRMR (minimum Redundancy Maximum Relevance). Picture picking 20 genes out of 1000 to predict a disease. The algorithm greedily picks features with high MI to the target Y and low MI to already-selected features. Doing the same with correlation only removes linear redundancy; MI also strips redundancy from "Xj contains Xi²" type non-linear duplication. Independent Component Analysis (ICA) is literally defined as minimising MI between recovered signals. Then there's neuroscience spike-information, gene regulatory networks, multimodal image registration, channel capacity — anywhere information theory shows up, MI shows up too.
🙋
One last thing. Even with independent variables (ρ = 0) the estimate isn't quite zero. Bug?
🎓
Not a bug — it's baked into MI estimation. With finite samples, p(x,y) will randomly look a bit non-independent, and that pushes the estimate positive. We call it finite-sample bias. So saying "MI = 0.01, hence dependent" is risky. Instead, build a null distribution by shuffling X, estimating MI again, repeating many times, and compare your raw value against the 95th percentile. Papers reporting MI usually publish both raw and shuffle-corrected values.
Frequently Asked Questions
Pearson's ρ captures only the linear part of the dependence between two variables, so for a relation such as Y = X² it returns ρ ≈ 0. Mutual information I(X;Y) measures the gap between the joint density p(x,y) and the product of marginals p(x)p(y), so it detects any dependence — linear or non-linear, discrete or continuous. The two coincide one-to-one only in the special case of a bivariate Gaussian, where I = -½·log(1-ρ²). For arbitrary distributions, MI is far more general: I = 0 ⇔ independent, I → ∞ for full dependence.
KSG (Kraskov-Stögbauer-Grassberger, 2004) is a k-nearest-neighbour estimator with very low bias even at small sample sizes — the default choice in practice, typically with k = 3-5. Histogram binning is fast but suffers from the curse of dimensionality: the bin count grows exponentially with dimension and bias explodes. KDE has small bias but high variance and is sensitive to bandwidth choice. Rule of thumb: use KSG for 1-2D, binning for visualisation, KDE when you need a smooth density estimate as a by-product.
(1) Feature selection — the mRMR (minimum Redundancy Maximum Relevance) algorithm greedily picks features with high MI to the target and low MI to already-selected features. (2) Independent Component Analysis (ICA) is defined by minimising MI between recovered signals. (3) Neuroscience: MI between spikes and stimulus quantifies how many bits a neuron transmits. (4) Gene regulatory network inference, (5) channel capacity in communications, and (6) multimodal image registration all use MI as a core metric.
MI estimators almost always carry a positive finite-sample bias. The Miller-Madow correction (B-1)/(2N) (B = bins, N = samples) is the classical example, scaling with the square of the bin count. KSG cuts the bias dramatically but does not eliminate it — a residual of O(1/N) remains. Even for genuinely independent variables (true MI = 0), the estimate is slightly positive. The standard workaround is a shuffle test: shuffle X randomly, recompute MI to obtain a null distribution, and compare your raw estimate against its 95th percentile.
Real-World Applications
Feature selection and dimensionality reduction: In tabular-data machine learning, MI is widely used as a preprocessing step to pick a few dozen features from thousands. The canonical example is Peng et al.'s mRMR (2005), which maximises MI to the target Y while minimising MI between already-selected features, producing a feature set that is both predictive and non-redundant. scikit-learn's mutual_info_classif / mutual_info_regression are KSG-based and can be plugged in directly.
Neuroscience and neural decoding: Measuring the MI in bits between a visual stimulus and the firing rate of a V1 neuron quantifies "how many bits of stimulus information this neuron transmits". In Shannon terms, you are measuring its encoding capacity. The Strong & Bialek group's 1998 paper used MI to show that the H1 neuron in the fly visual system carries tens of bits per second.
Multimodal image registration: Aligning images from different modalities (e.g. CT and MRI) is done by maximising MI between pixel intensities under candidate translations and rotations (Maes et al. 1997, Viola & Wells 1997). Absolute intensities differ across modalities, but "the same tissue appears as the same intensity pair" is a statistical dependence MI captures cleanly — far more robust than cross-correlation.
Gene regulatory network inference: ARACNE (Margolin et al. 2006) computes MI between every gene pair from expression data and uses the Data Processing Inequality to remove indirect correlations, reconstructing a regulatory network. The advantage over linear correlation is that it surfaces non-linear co-expression patterns invisible to ρ.
Common Misconceptions and Pitfalls
The biggest pitfall is treating raw MI values as directly comparable. MI is reported in nats or bits, but the systematic bias of an estimator depends strongly on the algorithm, the sample size and the dimensionality. If KSG (k = 5, N = 1000) gives MI = 0.20 nats for X-Y and binning (B = 20) gives MI = 0.30 nats for X-Z, you cannot conclude "Z depends more on X than Y does" — the biases are different in both sign and magnitude. Always compare with the same estimator, the same N, and ideally with a shuffle-corrected value (I_raw − mean I_shuffle).
Next, the jump from "MI is high" to "there is a causal link". MI is symmetric — I(X;Y) = I(Y;X) — and cannot distinguish X → Y, Y → X, or a hidden confounder Z driving both. For directionality in time series you need Transfer Entropy (Schreiber 2000) or Granger causality. Writing "MI was high, therefore X causes Y" in a paper is a reliable way to attract a reviewer's red pen. State explicitly: MI measures dependence, not direction.
Finally, "high-dimensional MI just works". Estimating MI between two variables of dimension d converges at the slow rate O(N^{-1/d}) even with KSG; once d > 10, even thousands of samples will not give a reliable value — a textbook case of the curse of dimensionality. Real-world recipes: (1) reduce dimension first with PCA or an autoencoder, (2) decompose into pairwise MI, or (3) use a neural MI estimator such as MINE (Belghazi et al. 2018). Do not naively compute "MI between two 100-dimensional vectors".
How to Use
Set sample size (numSamplesMI: 100–10,000) to control dataset magnitude for MI estimation
Adjust correlation coefficient (correlationCoef: 0–0.99) to define linear dependence strength between variables X and Y
Run simulation to compute Exact MI in nats/bits and compare against Estimated MI with bias and 95% confidence interval width
Worked Example
For bivariate Gaussian with N=5,000 samples, ρ=0.75 correlation, and k-NN estimator (k=5): Exact MI calculates to 0.628 nats (0.906 bits). The k-NN estimator typically returns 0.612 nats with bias of −0.016 nats and 95% CI width of 0.084 nats, yielding relative error of 2.5%. Histogram method with 20 bins produces 0.595 nats (8.3% error) but runs faster with less bias variance on discrete data.
Practical Notes
k-NN estimators exhibit lower bias for continuous distributions; use k=3–8 for N<2,000 and k=10–15 for N>5,000 to balance variance-bias tradeoff
Histogram binning severely underestimates MI in high-dimensional problems; reserve for univariate/bivariate exploratory analysis only
Exact MI assumes known joint distribution; Estimated MI quantifies real-world degradation when using finite samples or noisy observations in feature selection pipelines
95% CI width >0.1 nats signals insufficient sample size; increase N or accept wider uncertainty bounds in information bottleneck applications