Pooling Layer Simulator Back
Machine Learning

Pooling Layer Simulator — CNN

Explore the pooling layer of a CNN (convolutional neural network). Switch between max pooling and average pooling, change the pool size, stride and input feature map, and the output size, operation count and sliding-window animation update in real time — so you can grasp how a pooling layer downsamples a feature map.

Parameters
Pooling type
How the representative value of each window is taken
Pool size k
Side length of the k×k window
Stride s
Pixels the window moves per step
Input feature-map size N×N
px
Side of the input map the pooling is applied to
Input pattern
Shape of the input feature map being pooled
Results
Output size
Downsampling factor
Output elements
Learnable parameters
Operation count
Pooling effect
Pooling scan animation (input → output)

Left is the input feature map, right is the pooled output. The yellow box is the k×k pooling window; it slides by the stride and computes each output cell in turn.

Output size vs pool size
Middle row: input vs output
Theory & Key Formulas

$$\text{outN}=\left\lfloor\frac{N-k}{s}\right\rfloor+1$$

Side length of the output feature map. N: input size, k: pool size, s: stride. The output is outN×outN.

$$y_{max}=\max_{(i,j)\in\text{window}}x_{ij},\qquad y_{avg}=\frac{1}{k^2}\sum_{(i,j)}x_{ij}$$

Max pooling outputs the maximum in the window; average pooling outputs the mean. k is the pool size, s the stride, and a pooling layer has zero learnable parameters.

What is a Pooling Layer?

🙋
In CNN diagrams, a "pooling layer" usually comes right after a convolution layer. What does it actually do?
🎓
In short, it is the "shrink the feature map" stage. A convolution layer finds edges and textures in the image and produces a feature map, but if you keep it at full size the width and height are too big and the deeper layers blow up in compute and memory. So a pooling layer collapses each k×k window into one representative value and downsamples the map. Set the "stride" slider on the left to 2 and you will see the output become half the width and height.
🙋
Why not just throw pixels away? Why can I choose between "max pooling" and "average pooling"?
🎓
Good question. There are two ways to collapse a window into one value. Max pooling keeps only the "strongest value" in the window. In the feature map after a convolution, a large value is strong evidence that "an edge or texture is here", so taking the maximum keeps the prominent features intact. That is the most common choice in CNNs. Average pooling takes the mean of the window, so the feature is smoothed and blurred. Compare the two output profiles in the chart on the upper right and in the scan animation.
🙋
Does the pooling layer get smarter through training, like a convolution layer?
🎓
No — and this is the key point: a pooling layer has no learnable parameters at all. That is why "Learnable parameters" on the left stays at 0. The "take the maximum" or "take the average" operation is fixed; there are no weights that change through backpropagation. That contrasts with a convolution layer, which learns its kernel numbers. The only things pooling fixes are the hyperparameters pool size k and stride s, which the designer chooses in advance.
🙋
If the layer doesn't learn anything, why bother inserting it?
🎓
There are two big reasons. One is compute reduction — a stride-2 pooling halves the width and height, so the area drops to a quarter, and the compute and memory of the deeper layers shrink dramatically. The other is translation invariance. With max pooling, even if a feature shifts by one or two pixels inside the window, the output is unchanged as long as the maximum comes from the same pixel. So a dog is still recognized stably even when it is slightly displaced in the image. LeNet for handwritten digits and the famous AlexNet on ImageNet both insert pooling after convolution.
🙋
Is pooling still used in modern models?
🎓
The way it is used has shifted a bit. Classic CNNs stacked "convolution + max pooling" as a single block over and over. Recent networks such as ResNet often replace mid-network downsampling with a "stride-2 convolution" and use fewer pooling layers in between. But "global average pooling", which collapses the whole feature map into one point at the end of the network, is still a standard. Pooling keeps changing form, but it remains a basic building block of CNNs.

Frequently Asked Questions

A pooling layer downsamples a feature map inside a CNN (convolutional neural network). It is inserted after a convolution layer and slides a small k×k window over the feature map with stride s, replacing each window with a single representative value. This shrinks the spatial size and greatly reduces the computation and memory of the deeper layers. The output size is outN = floor((N−k)/s)+1.
Max pooling outputs only the maximum value in each window. It keeps the strongest activation — the most salient evidence that a feature is present — so it preserves prominent features such as edges and textures, and it is the most common choice in CNNs. Average pooling outputs the mean of all pixels in the window, so it smooths the feature, blending the whole window. It is used when you want an overall smooth response, or in the global average pooling of a final layer.
No. A pooling layer is a fixed operation — "take the maximum" or "take the average" — with no weights optimized during training (learnable parameters = 0). This is a key difference from convolution layers and fully connected layers, which do learn weights. The only things a pooling layer fixes are the pool size k and the stride s, which are hyperparameters chosen in advance by the designer and do not change during training.
Pooling gives the network a degree of translation invariance: when the input shifts slightly, the pooled output barely changes. With max pooling, even if a feature moves by one or two pixels within the window, the output stays the same as long as the maximum is still taken from the same pixel. This lets the network recognize an object even when it is slightly displaced in the image. The invariance only spans about the window size, however; large shifts need other tricks such as data augmentation.

Real-World Applications

Image-recognition AI (CNNs): The pooling layer is a basic building block used widely in image AI for object detection, face recognition, medical-image diagnosis and self-driving. LeNet for handwritten digits, and AlexNet and VGG that dominated ImageNet, were all structured as a "convolution layer + max pooling layer" block stacked repeatedly. By shrinking the spatial size in stages, pooling lets the network widen its receptive field and abstract features from local edges to broader object parts and overall shapes.

Cutting compute and memory: Carrying a high-resolution feature map all the way into the deep layers makes compute and memory explode. A single stride-2 pooling cuts the feature-map area to a quarter, greatly lightening the operations and memory of every later layer. This makes it possible to train deep networks within a limited GPU memory budget, and makes inference on edge devices such as smartphones practical.

Global average pooling: In recent CNNs it is standard to average the entire feature map (H×W) into one value with "global average pooling" at the final stage of the network. Used in place of a fully connected layer, it removes a huge number of parameters while suppressing overfitting and making the correspondence between feature maps and classes clearer. Class-activation-map (CAM) visualization is also built on global average pooling.

Signal processing and time-series data: The pooling idea applies not only to images but also to one-dimensional audio and sensor signals. Thinning out a time series with 1-D max or average pooling makes speech-recognition and anomaly-detection networks robust to shifts along the time axis and lighter to compute. Downsampling is a way of thinking shared across engineering, akin to the multigrid coarsening of CAE results.

Common Misconceptions and Pitfalls

The most common one is the misconception that "pooling learns". A pooling layer has no learnable weights at all, and its learnable-parameter count is always 0. During backpropagation, the gradient simply flows by a fixed rule — max pooling routes the gradient only to the pixel that held the maximum, average pooling spreads it evenly across the window — and the layer itself never gets smarter. What gets smarter are the weighted convolution and fully connected layers. Understand pooling as a purely fixed downsampling operation.

Next, overlooking the floor (truncation) in the output-size formula. The output size is outN = floor((N−k)/s)+1, with a floor. Depending on the combination of input size N, pool size k and stride s, a few pixels at the edge of the input may fall into no pooling window and be ignored. If you miss this, the output size you assumed and the actual size diverge, leading to shape-mismatch errors in the following layers. Designs that add padding to align the size are common.

Finally, overrating translation invariance. The invariance pooling provides is only for small shifts on the order of the window size. It cannot, on its own, handle an object that moves far across the image, or rotates and scales. To be robust to large displacements or pose changes, you need data augmentation that increases training samples (translation, rotation, flipping) or a more global mechanism. Pooling is "insurance that absorbs small shifts", not a magic spell that grants universal invariance.

How to Use

  1. Set pooling kernel size (2×2 or 3×3) using poolSizeNum slider; typical CNNs use 2×2 for VGG-16 and ResNet architectures.
  2. Configure stride value (1 to kernel size) via strideNum; stride=2 with 2×2 kernel reduces spatial dimensions by 50%.
  3. Enter input feature map size (gridNNum, e.g., 28×28 from MNIST or 64×64 from custom dataset) to calculate output dimensions.
  4. Select pooling operation: max pooling retains highest activation per window (standard for ImageNet), average pooling smooths feature maps.
  5. Observe Output size, Downsampling factor, and Operation count in real time as parameters change.

Worked Example

Input: 32×32 feature maps (post-convolution layer in CIFAR-10 pipeline), 2×2 max pooling kernel, stride=2. Calculation: output_size = (32 − 2) / 2 + 1 = 16×16. Downsampling factor = 4. Total pooling operations = 16 × 16 × 4 = 1024 comparisons. With 64 channels, Operation count = 65,536. Zero learnable parameters added (pooling is parameter-free). Result: spatial redundancy eliminated, computational cost reduced by 75% before next convolution layer.

Practical Notes

  1. Stride=1 with 3×3 pooling on 224×224 ImageNet feature maps produces 222×222 output; stride=2 reduces to 111×111, critical for memory-constrained deployment on edge devices.
  2. Average pooling (stride=pool_size, no overlap) preserves gradient flow in early ResNet residual connections; max pooling causes gradient sparsity in deeper blocks.
  3. Output elements = (output_width × output_height × channels); for VGG-16 block 1: 224→112×112×64 = 802,816 elements vs. input 224×224×64 = 3,211,264, reducing downstream FLOP burden by ~75%.
  4. Non-square kernels (e.g., 2×3) on rectangular inputs require explicit calculation: output_h = (input_h − pool_h) / stride_h + 1.