Hi,
I’m writing this as a PSA after having spent way too much time figuring out why HTTP/3 would always stop working on my server after a couple of days.
My troubleshooting:
The problem always went the same, HTTP/3 connections worked just as expected, then a couple of days would pass, and I would see in my server logs that HTTP/3 traffic would fall back to HTTP/2.
I kept trying to find a way the cause of the problem, but with the multi-day timeline that was sometimes required to make it reproduce, it took me a while to get to the bottom of it.
I eventually found out that the issue consistently reproduced when I set quic_bpf on;. I started looking around the Nginx bugtracker and eventually found issue #425.
This bug will occur when:
quic_bpfis set toon.- More than 1
worker_processesare used reuseportis used withlisten 443 quic reuseport;- Nginx is reloaded with
nginx -s reloadorsystemctl reload nginx.service
TL;DR Of the root cause is that Nginx doesn’t close stale reuseport QUIC sockets, skipping them as if reuseport wasn’t specified. This is clearly wrong. You can see the old sockets when running ss sport = :443 -lnpu. Shout out to tangxiao187 for finding the bug, reporting it, and submitting a PR with a fix.
Available workarounds:
- Take the potential performance hit of disabling
quic_bpf - Use only 1 worker process
- Apply config updates with a full restart instead of reload
In my opinion, all 3 of these options suck, especially for how easy this bug is to fix. For over a year, there has been a simple open pull request to fix this. It has been sitting without any communication from Nginx developers for almost as long.
In the big picture, my opinion doesn’t really matter, but I think that an undocumented, known issue that silently drops production traffic is unacceptable. The Nginx leadership instead seems to be focused on “Agentic Observability in Nginx”.
This isn’t a take down of the Nginx developers, I greatly appreciate your work. I’m writing this to make sure no one spends weeks debugging this issue, and hoping that someone from the Nginx team takes notice and a fix gets implemented in time for the 1.30 release which is supposed to ship in ~ a month, which I think is a no-brainer, or at the very least documents this behavior.