my network topology is somewhat convoluted, but it’s something like:
server A: service ← caddy
server B simply contains an nginx stream proxy using ssl_preread to route based on the target domain name (SNI) of the stream
server A is hidden under both premise NAT and CGNAT, so it has to be connected to server B by maintaining a wireguard tunnel (with a keepalive of 5 seconds), and server B routes connections to server A through it
connections reaching the service hosted in server A occasionally hang, with serverside errors such as “Connection from client lost before response was sent”
by attempting to replace caddy with nginx, and by testing with clients from unrelated networks, i could rule out server A’s proxy and the networks of incoming requests from being the root cause
there seem to be three places that could be the root cause of the issue:
the nginx proxy
the wireguard tunnel
the general infrastructure connecting server A to server B
option 1 seems to be the easiest to debug for now, which is why i’m here
does anyone have any ideas?
Just to be clear on your setup, there is a tunnel between the nginx node and backend application, correct? Does that mean your nginx config is referencing a localhost address for the proxy_pass directive? With the error you shared, it sounds like the most likely issue is with the wireguard tunnel. Can you verify that the tunnel is up and working when you see this error? It would be helpful if you could share the NGINX configuration that you are using as well.
To add to what @Damian_Curry mentioned, it would also be useful to see what your NGINX access and error logs say. You should be able to find them in the /var/log/nginx directory.
However, I seem to have been able to make it work in the meantime by enabling keepalives for the upstream connection (proxy_socket_keepalive on;). I’m honestly not sure why, since that seems like odd behavior, but if I had to guess maybe the extra packets made the tunnel more stable somehow? Perhaps it caused the CGNAT to persist connection mappings more reliably (as wireguard relies on that for NAT traversal)? The exact reason is beyond me, but it has been fixed.