Why does NGF hardcode pod IPs in upstream instead of using dynamic endpoint tracking?

My issue:
New NGINX Gateway Fabric hardcodes pod IPs directly in upstream blocks. When a pod restarts and receives a new IP, the upstream config becomes stale, causing 502 Bad Gateway errors. The upstream is not dynamically updated to reflect the new pod IP.

How I encountered the problem:
We started receiving 502 Bad Gateway errors from service. Investigation revealed the nginx upstream block had a hardcoded pod IP (10.4.149.153), but the pod had restarted and been assigned a new IP (10.4.150.240). Traffic was being routed to a dead IP until a manual nginx reload was triggered.

Upstream config observed in new fabric:

upstream default_formula1-care-portal-productqa_80 {
  random two least_conn;
  zone default_formula1-care-portal-productqa_80 512k;
  server 10.4.149.153:80;
  keepalive 16;
}

Our older nginx ingress controller routes via proxy_pass http://upstream_balancer (Lua-based dynamic balancer) and does not have this issue.

Solutions I’ve tried:

  • Manual nginx reload on the fabric controller — this temporarily resolves the issue by re-syncing the pod IP
  • Confirmed pod is healthy (1/1 Running, 0 restarts) — issue is purely with stale upstream IP in the fabric config

Version of NGF

  • NGF version: [nginx-gateway-fabric:2.4.2]

Deployment environment:


Hi @anwer_shahith, hardcoded IPs in the upstream is how NGINX is configured, this is not unique to NGINX Gateway Fabric. We also use proxy_pass http://upstream which proxies the request to the server IPs defined in the upstream. ingress-nginx used a lot of Lua to change default nginx behavior, which we do not use.

Whenever Pod IPs change, our controller updates the nginx config with the new IP addresses, the same way we update the config if any other policy or route changes. If that’s not happening then there’s a chance you’re facing a similar bug as defined here. Though this bug sees 504s, not 502s.

I’d be curious if you see any error logs in either the control plane or the data plane that you can share that might help us figure out why the addresses are not being updated. Also please share if you see any error statuses on your various resources (kubectl describe gateway, kubectl describe httproute, and so on).

Hi @sjberman This seems similar to the issue mentioned in PR #4697 and appears relevant to our case.
Here is the diagnostic data log you asked for:

2026/04/13 19:10:38 [error] 59646#59646: *419451 connect() failed (113: Host is unreachable) while connecting to upstream, client: 34.217.241.79, server: api-server-uat.company.com, request: “GET /oauth/token?grant_type=client_credentials&scope=all HTTP/1.1”, upstream: “http://10.4.139.19:8080/oauth/token?grant_type=client_credentials&scope=all”, host: “api-server-uat.company.com”{ “message”: “502 GET http://api-server-uat.company.com/oauth/token?grant_type=client_credentials&scope=all”,“host”: “api-server-uat.company.com”,2026/04/13 19:10:41 [error] 59646#59646: *419451 connect() failed (113: Host is unreachable) while connecting to upstream, client: 34.217.241.79, server: api-server-uat.company.com, request: “GET /oauth/token?grant_type=client_credentials&scope=all HTTP/1.1”, upstream: “http://10.4.139.19:8080/oauth/token?grant_type=client_credentials&scope=all”, host: “api-server-uat.company.com”{ “message”: “502 GET http://api-server-uat.company.com/oauth/token?grant_type=client_credentials&scope=all”,“host”: “api-server-uat.company.com”,

Logs from kubectl describe gateway and kubectl describe httproute may not be relevant at this stage, as the issue occurred a few days ago and the associated events are no longer available.

Hi @anwer_shahith thanks for these details. I’m trying to create a way to reliably produce this bug.

Do you know what happened in your environment when you encountered this? Did you perform a rolling-restart/scale on your deployments or were the deployments deleted and re-created?

Hi, thanks for looking into this.

From my observation, this issue occurred when the application pod was recreated. At that time, there was only one backend pod running.

However, across the nginx replicas, I noticed inconsistent upstream configurations:

  • Some nginx pods were still routing traffic to the old pod IP

  • Others had already updated to the new pod IP

It seems that when the application pod was recreated and assigned a new IP, the update was not propagated consistently across all nginx replicas. Only a subset of nginx pods picked up the new endpoint, while others continued using the stale IP.

To fix this I I had to explicitly perform a rolling restart of nginx.

Let me know if you need additional details I have attached the configuration snapshot of nginx -T

nginx-log.txt (1.3 KB)

Thanks @anwer_shahith for the logs attached, that should really help.

Right now, there are a few avenues we’re exploring.

From a Kubernetes perspective, it could be

  1. The Kubernetes API server not updating EndpointSlices in time
  2. The local controller cache not being updated with slow build time

Once we’ve reliably re-produced the bug, we’ll update you here.

One other thing I forgot to ask. Do you know roughly how many replicas of NGINX you had as well? We will try to test with a varying number of replicas as well, but it will be good to know exactly how many you had as well.

Thanks!

We were running NGINX as a DaemonSet, so the pod count was not less than 10.