My issue:
New NGINX Gateway Fabric hardcodes pod IPs directly in upstream blocks. When a pod restarts and receives a new IP, the upstream config becomes stale, causing 502 Bad Gateway errors. The upstream is not dynamically updated to reflect the new pod IP.
How I encountered the problem:
We started receiving 502 Bad Gateway errors from service. Investigation revealed the nginx upstream block had a hardcoded pod IP (10.4.149.153), but the pod had restarted and been assigned a new IP (10.4.150.240). Traffic was being routed to a dead IP until a manual nginx reload was triggered.
Upstream config observed in new fabric:
upstream default_formula1-care-portal-productqa_80 {
random two least_conn;
zone default_formula1-care-portal-productqa_80 512k;
server 10.4.149.153:80;
keepalive 16;
}
Our older nginx ingress controller routes via proxy_pass http://upstream_balancer (Lua-based dynamic balancer) and does not have this issue.
Solutions I’ve tried:
Manual nginx reload on the fabric controller — this temporarily resolves the issue by re-syncing the pod IP
Confirmed pod is healthy (1/1 Running, 0 restarts) — issue is purely with stale upstream IP in the fabric config
Hi @anwer_shahith, hardcoded IPs in the upstream is how NGINX is configured, this is not unique to NGINX Gateway Fabric. We also use proxy_pass http://upstream which proxies the request to the server IPs defined in the upstream. ingress-nginx used a lot of Lua to change default nginx behavior, which we do not use.
Whenever Pod IPs change, our controller updates the nginx config with the new IP addresses, the same way we update the config if any other policy or route changes. If that’s not happening then there’s a chance you’re facing a similar bug as defined here. Though this bug sees 504s, not 502s.
I’d be curious if you see any error logs in either the control plane or the data plane that you can share that might help us figure out why the addresses are not being updated. Also please share if you see any error statuses on your various resources (kubectl describe gateway, kubectl describe httproute, and so on).
Logs from kubectl describe gateway and kubectl describe httproute may not be relevant at this stage, as the issue occurred a few days ago and the associated events are no longer available.
Hi @anwer_shahith thanks for these details. I’m trying to create a way to reliably produce this bug.
Do you know what happened in your environment when you encountered this? Did you perform a rolling-restart/scale on your deployments or were the deployments deleted and re-created?
From my observation, this issue occurred when the application pod was recreated. At that time, there was only one backend pod running.
However, across the nginx replicas, I noticed inconsistent upstream configurations:
Some nginx pods were still routing traffic to the old pod IP
Others had already updated to the new pod IP
It seems that when the application pod was recreated and assigned a new IP, the update was not propagated consistently across all nginx replicas. Only a subset of nginx pods picked up the new endpoint, while others continued using the stale IP.
To fix this I I had to explicitly perform a rolling restart of nginx.
Let me know if you need additional details I have attached the configuration snapshot of nginx -T
Thanks @anwer_shahith for the logs attached, that should really help.
Right now, there are a few avenues we’re exploring.
From a Kubernetes perspective, it could be
The Kubernetes API server not updating EndpointSlices in time
The local controller cache not being updated with slow build time
Once we’ve reliably re-produced the bug, we’ll update you here.
One other thing I forgot to ask. Do you know roughly how many replicas of NGINX you had as well? We will try to test with a varying number of replicas as well, but it will be good to know exactly how many you had as well.