Microphone Array Beamforming (Delay-and-Sum) Simulator

Q: Why does spatial aliasing occur?

When the spacing d exceeds half a wavelength (λ/2), a 'grating lobe' of equal strength appears in another direction and the array can no longer separate the target from that direction. The upper-frequency limit of the array is f_max = c/(2d). For d=5cm only frequencies up to 3.43kHz can be beamformed correctly. Broadband audio designs typically use nested sub-arrays with different spacings.

Design a uniform linear microphone array that emphasises sound from one direction by aligning and summing the channels (delay-and-sum beamforming). Sliders for N, spacing, frequency and steering angle update the beamwidth, array gain, interferer rejection and spatial-aliasing limit in real time, giving you the front-end design feel of smart speakers and conference systems.

Parameters

Number of microphones N

elements

Number of microphone elements in the array

Microphone spacing d

Center-to-center spacing. Above λ/2 spatial aliasing occurs

Signal frequency f

Acoustic frequency under analysis. Speech sits in 100 Hz – 8 kHz

Sound speed c

m/s

Varies with temperature (≈343 m/s at 20°C)

Steering angle θ_s

Direction of the source you want to emphasise. 0° is broadside

Interferer angle θ_i

Direction of the unwanted source (noise / competing talker)

Results

—

Wavelength λ (m)

—

Aperture L (m)

—

HPBW (°)

—

Array gain (dB)

—

Alias-free f_max (Hz)

—

Interferer rejection (dB)

—

Array layout and wavefront — live animation

N microphones sit on a line; the green arrow is the target direction, the red arrow is the interferer, and the lower polar plot is the resulting beam pattern (lobe shape).

Beam pattern |B(θ)| (dB)

HPBW vs frequency

Theory & Key Formulas

$$y(t) = \sum_{n=0}^{N-1} w_n\, x_n\!\left(t - \tau_n\right),\qquad \tau_n = \frac{n\,d\,\sin\theta_s}{c}$$

Delay-and-sum beamformer. τ_n is the delay applied to the n-th microphone, θ_s is the steering direction, c is the speed of sound. Uniform weights w_n = 1/N align the target-direction signal coherently.

$$\text{HPBW} \approx \frac{0.89\,\lambda}{L\,\cos\theta_s},\qquad L = (N-1)\,d$$

Half-power beamwidth of a uniform linear array. λ is wavelength and L the aperture length. The beam grows wider at lower frequency, shorter apertures, or large steering angles away from broadside.

$$|B(\theta)| = \left|\frac{\sin(N u)}{N\sin u}\right|,\quad u = \frac{\pi d}{\lambda}\bigl(\sin\theta - \sin\theta_s\bigr),\quad f_{\mathrm{alias}} = \frac{c}{2d}$$

Array factor (beam pattern) and the spatial-aliasing upper frequency. Above d ≥ λ/2 grating lobes appear and can no longer be distinguished from the target direction.

Microphone Array Beamforming — Delay-and-Sum

🙋

When you say "OK Google" to a smart speaker, somehow it picks up just the voice from the right direction. How does it know which way the talker is?

🎓

Good question — that is called microphone-array beamforming. On the top of an Amazon Echo you can actually count six or seven tiny holes; each one is a microphone. Sound arriving from a given direction reaches every mic with a slightly different time offset. If you compensate for that offset and then sum the channels, the wave from that direction lines up in phase and gets louder, while waves from other directions partially cancel. That is the simplest "delay-and-sum" beamformer.

🙋

So more microphones should give a sharper beam, right? When I increase N with the slider, the beam really does get narrower.

🎓

Right. HPBW ≈ 0.89·λ/L, so the longer the aperture L = (N−1)d the narrower the beam. But the array gain only grows as 10·log10(N): +9 dB at 8 mics, +12 dB at 16, +15 dB at 32. Four times more elements only buys you 6 dB. That is why even high-end conferencing arrays usually stop at 8–16 elements; beyond that the calibration cost is no longer worth the gain.

🙋

Can't I just open up the spacing d to make a really long aperture without adding mics?

🎓

That is the trap. Once d goes above λ/2 you hit spatial aliasing: a "grating lobe" pops up in another direction with the same strength as the main lobe, and you can no longer tell where the sound is coming from. Try setting f = 4000 Hz while keeping d = 5 cm — λ = 343/4000 ≈ 8.6 cm so λ/2 ≈ 4.3 cm, and d = 5 cm is already over the limit. The verdict turns red. The upper-frequency limit is f_alias = c/(2d).

🙋

What do designers do when they need a full 100 Hz – 8 kHz speech band, then?

🎓

They use nested sub-arrays with different spacings — wide spacing (say d = 15 cm) for low frequencies, narrow spacing (say d = 2 cm) for high — and switch which mics are used per band. On top of that, more advanced beamformers (MVDR / Capon, GSC, and lately neural beamformers) adapt the weights to actively null out interferers. But the intuition is built on delay-and-sum, so play with the sliders here and watch the beam fatten or develop grating lobes — once that feel is in your hands the adaptive methods click much faster.

Frequently Asked Questions

It is the simplest spatial filter: each microphone signal is delayed by τ_n = n·d·sin(θ_s)/c so that a wave arriving from the target direction θ_s aligns across the array, then summed and scaled by 1/N. Sources from θ_s add coherently, while signals from other directions partially cancel. It is the lowest-complexity beamformer and is widely used as the first stage in smart speakers and conference microphones.

When the spacing d exceeds half a wavelength (λ/2), a "grating lobe" of equal strength appears in another direction and the array can no longer separate the target from that direction. The upper-frequency limit of the array is f_max = c/(2d). For d=5cm only frequencies up to 3.43kHz can be beamformed correctly. Broadband audio designs typically use nested sub-arrays with different spacings.

For a uniform linear array the half-power beamwidth of the main lobe is approximately HPBW ≈ 0.89·λ/(L·cosθ_s) [rad], where L=(N−1)d is the aperture length and θ_s is the steering direction. The beam is wider for longer wavelengths (lower frequency), shorter apertures, and steering directions far from broadside, because of the 1/cosθ_s factor.

Against white spatial noise, an N-element delay-and-sum beamformer has a theoretical maximum gain of 10·log10(N) dB: +9.0dB for N=8, +12.0dB for N=16, +15.0dB for N=32 — only 6dB extra for every 4× increase in elements. Phase-matching error, microphone sensitivity spread and reverberation typically cost 1–3dB in practice, so reaching more than 12dB of effective gain requires calibration and environmental control.

Real-World Applications

Smart speakers and voice assistants: Amazon Echo, Google Home and Apple HomePod use 6–7 element circular arrays or 2-element stereo configurations. Once a "wake word" is detected from any direction, the device steers a beam towards that direction, suppresses background noise and competing talkers, and feeds the cleaned signal into the ASR engine. A common two-stage pattern is delay-and-sum for coarse steering followed by MVDR or neural enhancement.

Tele-conferencing and meeting-room systems: Microsoft Teams Rooms, Zoom Rooms, Polycom and Logitech Rally devices use 4–16 element linear or circular arrays to track the active talker. Even built-in laptop mics run a simple two-channel differential array to lift the user's voice above background noise. In reverberant rooms designers deliberately keep HPBW around 20–40°, since beams that are too narrow tend to lose the talker.

In-car voice recognition and hearing aids: Cars steer a beam at the driver's mouth so engine noise, road noise and passenger speech are reduced. A 2–4 mic array near the steering wheel is typical. Hearing aids place 2 mics on each ear and offer a "directional mode" that emphasises sound from in front of the user — mostly in the 1–4 kHz consonant band, where wavelengths are short enough to give useful directivity at the small inter-mic spacing.

Sonar, source localisation and drone detection: Underwater sonar and acoustic cameras for drone detection use arrays of dozens to hundreds of elements. Scanning the beamformer output over all directions and picking the peak gives a direction-of-arrival estimate. High-resolution DoA algorithms such as MUSIC and ESPRIT live on top of the delay-and-sum idea, so the intuition built here transfers directly to those advanced methods.

Common Misconceptions and Pitfalls

The first trap is assuming "more microphones automatically means a much sharper beam". While HPBW ≈ 0.89·λ/L does shrink linearly with aperture L, array gain only grows as 10·log10(N) — +6 dB per 4× increase. Pack microphones too tightly to extend the aperture and the inter-channel correlation rises, so the additional independent information is small. Phase mismatch and sensitivity spread also accumulate, so in practice arrays larger than 32 elements are rarely worth their calibration burden, and 8–16 elements is a common sweet spot.

The second trap is pushing d all the way to λ/2 to maximise resolution. At the design frequency it does maximise aperture, but the moment an unexpected high-frequency component (a harmonic or non-stationary noise burst) appears the array sees a grating lobe and reports a false direction. For broadband speech it is safer to keep d around 0.6–0.7 of the λ/2 limit and to nest sub-arrays of different spacings. Sweep f upward in this tool and you will see the verdict turn red the instant aliasing starts.

The third trap is believing delay-and-sum alone is enough. Delay-and-sum is the optimum for a known DoA and point-source assumption, but in reverberant rooms or with moving interferers the leaked reflections wrap into the beam and SNR improvement saturates. Real systems combine delay-and-sum with adaptive MVDR / Capon, GSC, or deep neural beamformers (Conv-TasNet style). Treat the numbers from this tool as an upper bound: if you need more, you need adaptive processing.

Microphone Array Beamforming (Delay-and-Sum) Simulator