Fix VPN Hanging on Large Files — MTU, MSS & PMTUD Black Hole Explained

You've owned this ticket. The VPN's up. Pings are 18ms. Web works. Email works. Then someone tries to download a 50MB attachment and it hangs at 4%. The user calls again. "The internet is slow, but only on the VPN."

It's not bandwidth. It's not the WAN. It's not the user's WiFi. It's the headers. By the end of this article you'll have the math, the diagnosis, and the fix — in that order — and you'll never misdiagnose this class of problem again.

★ THE LIE THAT IS 1500 BYTES

1500 is the wire MTU of standard Ethernet. That's it. It's not a guarantee, it's not a target, and it absolutely doesn't survive a tunnel.

Inside that 1500 lives 14 bytes Ethernet + 20 bytes IPv4 + 20 bytes TCP, leaving 1460 bytes of TCP payload — that's your TCP MSS on a vanilla LAN. Fine on the LAN. Not fine the moment you tunnel.

The myth: "we have GbE everywhere, 1500 is universal, MTU isn't a 2026 problem." The reality: every tunnel header has to fit inside the next-hop MTU, which is still 1500. Something has to give. Either the inner payload shrinks, or your packets fragment, or your packets vanish silently. Spoiler: in 2026, they vanish silently.

[ THE STACK YOU FORGOT ] Outer Ethernet → Outer IP → IPsec ESP → Inner IP → Inner TCP → whatever's left for actual data.

Every layer eats bytes. Every byte eaten comes off the inner TCP MSS. You can't grow 1500 bytes — you can only divide them up differently.

★ WHAT'S ACTUALLY EATING YOUR PAYLOAD

This is the section to bookmark. Real overhead numbers for the tunnel types you'll meet in the wild.

TUNNEL TYPE	OVERHEAD	EFFECTIVE MTU	TCP MSS TO CLAMP
IPsec ESP tunnel mode (AES-GCM)	~62 B	1438	1398
IPsec ESP + NAT-T (UDP 4500)	~78 B	1422	1382
GRE	24 B	1476	1436
GRE over IPsec	~86 B	1414	1374
WireGuard (UDP)	60 B	1440	1400
OpenVPN UDP (AES-GCM)	~50 B	1450	1410
FortiClient SSL-VPN (TLS/TCP)	TCP-in-TCP	tunnel-mtu 1273	use FGT setting
PPPoE (residential underlay)	−8 B	subtract 8	subtract 8

The point isn't to memorize a magic number. It's to calculate yours. A remote worker on residential PPPoE connecting through IPsec NAT-T isn't getting 1500. They're getting 1422 minus 8 = 1414 effective MTU, and an MSS budget of 1374. If you've got that user clamped to a generic "1380" you saw on a vendor doc, they're still losing two bytes per segment. Multiply by every connection on the tunnel and that's a measurable performance tax.

[ ✓ THE 28-BYTE TRICK ] MTU = (ping payload that succeeds with DF set) + 28.
That's 8 bytes ICMP header + 20 bytes IP header. Memorize it. You will use it weekly.

★ WHY PMTUD IS SUPPOSED TO SAVE YOU

Path MTU Discovery is the protocol that's supposed to make this entire article unnecessary. Here's how it's supposed to work:

Sender sets the DF (Don't Fragment) bit on every packet.
Some router along the path has a smaller MTU than the packet.
That router drops the packet and returns an ICMP Type 3 Code 4 — "Fragmentation Needed and DF Set" — back to the sender.
Sender sees the ICMP, lowers its effective MTU for that destination, retransmits.
Connection continues at the smaller size.

Elegant. Self-healing. End-to-end. Designed by people who knew what they were doing.

And dead. PMTUD is the protocol your security team killed.

[ ⚠ WHY PMTUD DOESN'T WORK IN PRACTICE ] 1. Firewall admins block ICMP wholesale because "ICMP is dangerous." Type 3 Code 4 is not optional — it's required by the IP standard for the network to self-tune. Block it and you've broken the protocol.

2. Asymmetric routing. The ICMP unreachable comes back via a different path that drops it before it reaches the original sender.

3. NAT devices that don't translate the embedded inner IP header inside the ICMP payload — sender sees the ICMP and can't match it to any session, so ignores it.

4. ECMP / anycast. The ICMP error originates from a node the sender never targeted, so the response gets discarded as unsolicited.

5. Sender stacks that give up silently. Linux ships with tcp_mtu_probing=0 by default — meaning the kernel doesn't even try black-hole detection.

End result: the sender keeps blasting 1500-byte packets with DF=1 into a tunnel that can only carry 1422. The router drops them. The ICMP that's supposed to tell the sender gets blocked, dropped, or ignored. The application stalls. There is no error. There is no retransmission feedback. Just a black hole.

If you ever capture a stalled VPN connection in Wireshark, you'll see this exact pattern: SYN and SYN-ACK negotiate MSS=1460, the first 1500-byte data segment goes out with DF set, and then... silence. The TCP retransmit timer fires, sends the same packet, gets the same silence. The user sees a hung browser. You see a crime scene.

★ THE ASYMMETRY TRAP

Here's the part that makes this maddening to diagnose if you don't already know what to look for.

TCP traffic is rarely symmetric. Server sends data → client; client sends ACKs → server. The data direction has full-size segments (1500 bytes). The ACK direction has tiny ones (~50 bytes). Only the big direction hits the MTU ceiling.

[ ⚠ WHY HALF YOUR TRAFFIC SEEMS TO WORK ] Server → client big segments: hit the MTU ceiling, get dropped, ICMP suppressed, hang.
Client → server small ACKs: well under the limit, fly through fine.

Result: "Downloads hang. Uploads are fine." Or vice versa, depending on direction. Or weirder: login pages work (small response), file fetches hang (big response). Teams chat works. Teams uploads hang. M365 web app loads. M365 attachment download dies at 4%.

This is why tickets get misdiagnosed for weeks. Ops blames the firewall, then the WAN, then the user's WiFi, then antivirus, then the laptop. Nobody ever blames the MTU because half the traffic is working perfectly and that doesn't fit anyone's mental model of "broken."

★ THE TEST THAT ACTUALLY PROVES IT

The ping-with-DF test, properly explained. Most blog posts get this wrong by skipping the math.

# Linux / macOS — find the largest payload that succeeds
ping -M do -s 1472 <destination>

# Windows — same test, different syntax
ping -f -l 1472 <destination>

Why 1472? Because 1472 payload + 8 ICMP header + 20 IP header = 1500 — exactly fills standard Ethernet MTU. If 1472 succeeds, your path MTU is 1500. If it fails, walk the size down: 1430, 1400, 1380, 1360. The largest size that succeeds + 28 = your real path MTU. Subtract 40 (TCP+IP) to get your real-world TCP MSS.

# Walk the MTU down until it passes
ping -M do -s 1472 10.0.0.1   # fails (1500 doesn't fit)
ping -M do -s 1430 10.0.0.1   # fails (1458 doesn't fit)
ping -M do -s 1400 10.0.0.1   # fails
ping -M do -s 1380 10.0.0.1   # PASSES → MTU = 1408, MSS = 1368

One caveat that catches people: this tests ICMP. Some paths treat ICMP differently than TCP — load balancers, CGNATs, and DPI middleboxes can ignore TCP MTU constraints they enforce on ICMP. So always confirm with a real TCP test:

# mtr but over TCP — finds the hop where loss begins
mtr --tcp --port 443 <destination>

# tracepath — Linux's PMTUD-aware traceroute
tracepath <destination>

# Or just curl a known large file and watch where it stalls
curl -v --max-time 30 https://example.com/large.bin -o /dev/null

★ THE FIX, IN ORDER OF WHERE TO APPLY IT

"Set MSS to 1360 and pray" is not a fix. Here's the actual decision tree, in priority order.

1. MSS CLAMPING AT THE TUNNEL INGRESS — RIGHT 90% OF THE TIME

This is the answer for TCP traffic. The tunnel-terminating device rewrites the MSS option in every TCP SYN that crosses it, forcing both endpoints to negotiate a smaller segment size that fits inside the tunnel. End hosts never know — they just stop sending oversized segments.

! Cisco IOS — apply on the tunnel interface
interface Tunnel0
 ip tcp adjust-mss 1380

! FortiGate — under firewall policy
config firewall policy
  edit <policy-id>
    set tcp-mss-sender 1380
    set tcp-mss-receiver 1380
  next
end

! Linux / iptables — clamps to whatever the kernel knows about the path
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
  -j TCPMSS --clamp-mss-to-pmtu

! pfSense / OPNsense — System > Advanced > Networking > "MSS"
! Set to 1380 (or your calculated value)

2. LOWER THE TUNNEL MTU ITSELF — REQUIRED FOR UDP / QUIC

MSS clamping only touches TCP SYNs. It does nothing for UDP traffic, which means it does nothing for QUIC, VoIP, large DNS responses, IPsec inside another tunnel, or anything else that doesn't use TCP. For those, you have to lower the tunnel interface MTU itself so the OS fragments at the right size.

! Cisco GRE
interface Tunnel0
 ip mtu 1400

! FortiGate IPsec phase-1 interface
config system interface
  edit <tunnel-name>
    set mtu-override enable
    set mtu 1400
  next
end

Trade-off: every packet pays the tax, even ones that didn't need to. But it works for everything, not just TCP.

3. ALLOW ICMP TYPE 3 CODE 4 END-TO-END — THE CORRECT FIX NOBODY DOES

Whitelist ICMP Type 3 Code 4 on every firewall in the path. PMTUD comes back to life. The protocol works as designed.

Why nobody does it: the security team won't approve it because "ICMP is bad." The 1990s threat model that produced that policy is two decades old, but it's load-bearing in every enterprise security policy template. So we clamp MSS and move on. Hard truth.

4. ENABLE BLACK-HOLE DETECTION ON THE HOST — LAST RESORT

# Linux — speculatively reduce MSS when probes time out
sysctl -w net.ipv4.tcp_mtu_probing=1

# Windows — registry tweak
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
  EnablePMTUBHDetect = 1 (DWORD)

The host detects "I keep retransmitting and getting nothing" and probabilistically lowers MSS until traffic flows. Slow, ugly, but works on hostile networks where you can't fix the path.

[ ⚠ DON'T DO THESE ] Don't disable DF. You'll permanently break PMTUD across your network and offload fragmentation cost to every router in the path. Routers hate fragmenting — it tanks throughput and forwarding rates.

Don't set MTU=1280 globally "to be safe." That's the IPv6 minimum and a reasonable last-resort tunnel MTU, but globally it costs you ~15% throughput on every link that didn't need it.

Don't switch protocols and call it fixed. Swapping IPsec for L2TP, OpenVPN, or WireGuard doesn't make MTU go away. The headers are different sizes but the principle is identical. Same problem, different costume.

★ THE FORTINET-SPECIFIC TRAP

If you're on FortiGate — and a fair chunk of you reading this are — there are a few traps worth knowing about.

First: FortiGate's IPsec MSS clamping is configured per firewall policy, not per tunnel interface. That means if you've got three policies all referencing the same IPsec tunnel and you only set the MSS on one of them, traffic crossing the other two policies goes unclamped. Easy to miss when cloning policies. If you're running a FortiGate at home, this'll bite you the first time you split your policies for different VLANs.

Second: FortiClient SSL-VPN has its own tunnel MTU, hidden under config vpn ssl settings as tunnel-mtu. Default is 1273. Yes, 1273. That's because SSL-VPN runs TCP-inside-TCP-inside-TLS, and Fortinet's default value is conservative for hostile transit networks. If your remote users complain about VPN being "slower than the internet itself," the default tunnel-mtu is one of three places to start.

Third: roaming users. Hotel WiFi, captive portals, and cellular tethers all do their own MTU shenanigans, often without telling you. FortiClient's tunnel MTU is set per the gateway, not adapted to the underlying network. Lower it deliberately for any tunnel that has roaming users on it — 1300 is a reasonable target.

[ ★ THE NUMBER YOU PROBABLY WANT ] Standard internet IPsec with NAT-T enabled (the typical site-to-site or dial-up VPN scenario):
MTU 1422 / MSS 1380.

With residential PPPoE underneath: MTU 1414 / MSS 1372.

For FortiClient SSL-VPN with roaming users: tunnel-mtu 1300.

Useful diagnostics on FortiGate that are worth keeping in your back pocket:

! Confirm tunnel up
diagnose vpn ike gateway list

! Sniff TCP SYNs to inspect the negotiated MSS in real time
diagnose sniffer packet any 'host <peer> and tcp[13]&2 != 0' 4 0 a

! Show interface MTU
diagnose hardware deviceinfo nic <interface>

★ UDP DOESN'T CARE — AND QUIC IS COMING

Here's the part of this article that ages everyone else's blog posts on this topic: MSS clamping is a TCP fix. It rewrites the MSS option in TCP SYN packets. UDP has no equivalent. The fix is invisible to UDP traffic.

For most of internet history that was fine. UDP was DNS, NTP, and a few games. None of those sent MTU-sized packets. The TCP-only fix was effectively the whole fix.

Then came QUIC. HTTP/3 runs over UDP. Google services run over QUIC. M365 increasingly runs over QUIC. Cloudflare runs over QUIC. By 2026, a meaningful percentage of any browser session is UDP packets riding on top of QUIC, sometimes with full-MTU datagrams.

Your TCP MSS clamping does nothing for any of it. The problem you "fixed" last quarter regresses the day Chrome upgrades the user. The right answer in 2026 is increasingly option 2 from the fix list — lower the tunnel MTU itself, so fragmentation works across all protocols. The TCP-only fix is a relic.

Workaround that some firewalls implement: block UDP/443 outbound. Chrome falls back to TCP, MSS clamping works again. It's ugly, it costs you the QUIC performance benefit, but it makes the symptom stop while you do the proper fix. Don't ship this as your answer — ship it as the bridge while you lower tunnel MTU correctly.

★ HOW TO VERIFY YOU ACTUALLY FIXED IT

Don't ship a fix you haven't proven. Three quick verification steps:

Capture a fresh TCP handshake post-fix in Wireshark. Look at the SYN. The MSS option should now be your clamped value (e.g. 1380), not 1460. If it's still 1460, your clamp isn't taking effect.
Run a real-world large-file test. Pull a known 100MB+ file across the tunnel. Throughput should stay linear instead of stalling at packet boundaries. If it stalls, you've fixed TCP but not UDP — go back to fix #2.
mtr --tcp from the client through the tunnel. Watch where loss begins (or doesn't, post-fix). The hop pattern tells you whether the fix is at the right node in the path.

★ THE TAKEAWAY

The network was working all along. The protocol that's supposed to find this for you (PMTUD) has been quietly broken for two decades because every firewall admin overpruned ICMP. Until that changes industry-wide — and it won't — MSS clamping at every tunnel ingress is non-negotiable infrastructure hygiene, not optional tuning. In real enterprise networks, this is one of the first things any senior network engineer checks when "the VPN is slow" comes across the wire.

Bookmark the math table. Memorize the 28-byte trick. Understand the asymmetry trap so you can diagnose it on the first ticket instead of the fifth. And start planning the move from MSS clamping (TCP-only) to tunnel-MTU lowering (universal), because QUIC isn't coming — it's already here.

One line to take with you:
If your VPN seems to work but doesn't, the answer is almost always the headers._

◀ PREVIOUS SUBNETTING CHEAT SHEET — CIDR REFERENCE RELATED ▶ FORTINET ON A BUDGET — MY STACK

THE 1500-BYTE LIEWHY YOUR VPN HANGS ON BIG FILES

★ THE LIE THAT IS 1500 BYTES

★ WHAT'S ACTUALLY EATING YOUR PAYLOAD

★ WHY PMTUD IS SUPPOSED TO SAVE YOU

★ THE ASYMMETRY TRAP

★ THE TEST THAT ACTUALLY PROVES IT

★ THE FIX, IN ORDER OF WHERE TO APPLY IT

1. MSS CLAMPING AT THE TUNNEL INGRESS — RIGHT 90% OF THE TIME

2. LOWER THE TUNNEL MTU ITSELF — REQUIRED FOR UDP / QUIC

3. ALLOW ICMP TYPE 3 CODE 4 END-TO-END — THE CORRECT FIX NOBODY DOES

4. ENABLE BLACK-HOLE DETECTION ON THE HOST — LAST RESORT

★ THE FORTINET-SPECIFIC TRAP

★ UDP DOESN'T CARE — AND QUIC IS COMING

★ HOW TO VERIFY YOU ACTUALLY FIXED IT

★ THE TAKEAWAY

THE 1500-BYTE LIE
WHY YOUR VPN HANGS ON BIG FILES