Segmantation faults - occurent 502 bad gateway

My issue:
I am running Nginx (1.26.3 version-stable) as reverse proxy along with NTLM module for my small Azure DevOps on-prem cluster (2 app tiers, 1 search server, 1 SQL Server).

Since some time I am getting segfaults errors along with terminated working process as below:

[944987.909723] nginx[700512]: segfault at 28 ip 0000000000442d7c sp 00007fffb0a70ca0 error 4 in nginx[413000+b1000] likely on CPU 0 (core 0, socket 0)
[944987.910787] Code: c5 ff ff ff ff e9 c9 fd ff ff 48 c7 c5 ff ff ff ff e9 bd fd ff ff 53 48 8b 1f f6 47 09 04 74 0f 48 8b 83 90 00 00 00 48 89 df <ff> 50 28 5b c3 48 89 df e8 88 f9 ff ff 48 83 f8 fe 74 f0 48 8b 83
[944988.684438] nginx[701522]: segfault at 28 ip 00000000004661f3 sp 00007fffb0a70ca0 error 4 in nginx[413000+b1000] likely on CPU 1 (core 1, socket 0)
[944988.685217] Code: 09 f7 e9 63 ff ff ff 48 89 df ff 53 38 4c 89 e7 e8 d4 f5 fe ff eb a8 53 48 8b 07 48 8b 00 48 8b 48 48 48 8b 58 08 48 8b 53 50 <48> 8b 52 28 48 89 42 10 0f b6 57 09 83 e2 14 80 fa 14 74 19 f6 47

And here:

2025/03/30 21:00:02 [alert] 699285#699285: connection already closed
2025/03/30 21:00:02 [alert] 699285#699285: connection already closed
2025/03/30 21:00:02 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699285
2025/03/30 21:00:02 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699285
2025/03/30 21:00:02 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699285
2025/03/30 21:00:02 [alert] 613185#613185: worker process 699285 exited on signal 11
2025/03/30 21:00:02 [alert] 613185#613185: worker process 699285 exited on signal 11
2025/03/30 21:00:02 [alert] 613185#613185: worker process 699285 exited on signal 11
2025/03/30 21:00:02 [alert] 613185#613185: worker process 699285 exited on signal 11
2025/03/30 21:00:02 [notice] 613185#613185: start worker process 700202
2025/03/30 21:00:02 [notice] 613185#613185: start worker process 700202
2025/03/30 21:00:02 [notice] 613185#613185: start worker process 700202
2025/03/30 21:00:02 [notice] 613185#613185: signal 29 (SIGIO) received
2025/03/30 21:00:02 [notice] 613185#613185: signal 29 (SIGIO) received
2025/03/30 21:00:02 [notice] 613185#613185: signal 29 (SIGIO) received
2025/03/30 21:00:04 [alert] 699354#699354: connection already closed
2025/03/30 21:00:04 [alert] 699354#699354: connection already closed
2025/03/30 21:00:04 [alert] 699354#699354: connection already closed
2025/03/30 21:00:04 [alert] 699354#699354: connection already closed
2025/03/30 21:00:04 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699354
2025/03/30 21:00:04 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699354
2025/03/30 21:00:04 [notice] 613185#613185: signal 17 (SIGCHLD) received from 699354
Blockquote

How I encountered the problem:

Problem appeared when I have only enabled proxying SSH connections to backend servers declared in upstream directive.

Solutions I’ve tried:

I tried to debug the issue with gdb but it was not leading me anywhere. I have tried with various PIDs (sudo gdb -p proc_id).

My config:

That’s my primary nginx.conf

#user nobody;

error_log logs/error.log;
error_log logs/error.log notice;
error_log logs/error.log info;
error_log logs/error.log debug;

pid sbin/nginx.pid;

worker_processes auto;
worker_rlimit_nofile 65535;

events {
worker_connections 4096;
}

#SSH block

stream {
upstream ssh_backend {
least_conn;
server server1.domain.local:22;
server server2.domain.local:22;
}

server {
listen      22;
proxy_pass  ssh_backend;
proxy_timeout       1h;
proxy_connect_timeout       600s;
}

}

http {
include mime.types;
include /usr/local/nginx/conf_templates/*.conf;
default_type application/octet-stream;

log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                  '$status $body_bytes_sent "$http_referer" "http_user_agent" '
                  '$request_length $request_time $upstream_addr '
                  '$upstream_response_length $upstream_response_time $upstream_status ';

access_log  logs/access.log  main;

sendfile        on;
#tcp_nopush     on;

#keepalive_timeout  0;
keepalive_timeout  65;
proxy_read_timeout  1800s;
proxy_send_timeout  1800s;
proxy_connect_timeout 60s;

proxy_next_upstream error timeout http_502 http_504;
proxy_next_upstream_tries 3;


#gzip  on;
server_names_hash_bucket_size       128;
client_max_body_size 200M;

upstream qa_azure_devops {
    least_conn;
    server server1.domain.local:443 max_fails=3 fail_timeout=30s;
    server server2.domain.local:443 max_fails=3 fail_timeout=30s;
    ntlm;
    }


server {
    listen       443 ssl;
    server_name  qa.azuredevops.domain.local;

    ssl_certificate /etc/pki/tls/certs/qa-azuredevops-swdc.cer;
    ssl_certificate_key /etc/pki/tls/private/qa-azuredevops-swdc.key;
    ssl_protocols   TLSv1.2 TLSv1.3;
    ssl_session_timeout 4h;
    ssl_session_cache shared:SSL:50m;
    ssl_session_tickets off;
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
    #charset koi8-r;

    #access_log  logs/host.access.log  main;

    location / {
        proxy_pass https://qa_azure_devops;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Authorization $http_authorization;
        proxy_set_header Connection 'Keep-Alive';
        proxy_set_header X-Forwarded-Proto $scheme;
        add_header X-Upstream-Server $upstream_addr;
        proxy_ssl_server_name on;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }


}

}

Kindly please for help ! :frowning:

Can you please add some more detail about when/how this error is showing up? Does the segfault happen when you try and start nginx or does it happen after it has been running?

1 Like

Hello Damian_Curry,

Thank you for you answer.

Problem occured when only more users started to logon to my working environment via Nginx. Please check the logs from journalctl when issue appeared for first time below:

Mar 20 07:43:24 ch00salmngx01 nginx[19706]: nginx: the configuration file /usr/local/nginx/conf/nginx.conf syntax is ok
Mar 20 07:43:24 ch00salmngx01 nginx[19706]: nginx: configuration file /usr/local/nginx/conf/nginx.conf test is successful
Mar 20 07:43:24 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:22 failed (98: Address already in use)
Mar 20 07:43:24 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
Mar 20 07:43:25 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:22 failed (98: Address already in use)
Mar 20 07:43:25 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
Mar 20 07:43:25 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:22 failed (98: Address already in use)
Mar 20 07:43:25 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
Mar 20 07:43:26 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:22 failed (98: Address already in use)
Mar 20 07:43:26 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
Mar 20 07:43:26 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:22 failed (98: Address already in use)
Mar 20 07:43:26 ch00salmngx01 nginx[19707]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
> Mar 20 07:43:27 ch00salmngx01 nginx[19707]: nginx: [emerg] still could not bind()
Mar 20 07:43:27 ch00salmngx01 systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Mar 20 07:43:27 ch00salmngx01 systemd[1]: nginx.service: Failed with result ‘exit-code’.
Mar 20 07:43:27 ch00salmngx01 systemd[1]: Failed to start The Nginx HTTP and Load Balancer Server for CAD Team.
Mar 20 07:44:14 ch00salmngx01 systemd[19634]: Starting Mark boot as successful…
Mar 20 07:44:14 ch00salmngx01 systemd[19634]: Finished Mark boot as successful.
Mar 20 07:47:14 ch00salmngx01 systemd[19634]: Created slice User Background Tasks Slice.
Mar 20 07:47:14 ch00salmngx01 systemd[19634]: Starting Cleanup of User’s Temporary Files and Directories…
Mar 20 07:47:14 ch00salmngx01 systemd[19634]: Finished Cleanup of User’s Temporary Files and Directories.
Mar 20 07:49:14 ch00salmngx01 sssd_kcm[19632]: Shutting down (status = 0)
Mar 20 07:49:14 ch00salmngx01 systemd[1]: sssd-kcm.service: Deactivated successfully.
Mar 20 07:50:55 ch00salmngx01 kernel: nginx[19694]: segfault at 28 ip 00000000004661f3 sp 00007ffc26c40550 error 4 in nginx[413000+b1000] likely on CPU 1 (core 1, socke>

And then I notice this:

Mar 20 07:50:56 ch00salmngx01 systemd-coredump[19776]: [🡕] Process 19694 (nginx) of user 65534 dumped core.

                                                   Stack trace of thread 19694:
                                                   #0  0x00000000004661f3 ngx_http_upstream_handler (nginx + 0x661f3)
                                                   #1  0x0000000000435486 ngx_event_process_posted (nginx + 0x35486)
                                                   #2  0x0000000000435025 ngx_process_events_and_timers (nginx + 0x35025)
                                                   #3  0x000000000043c806 ngx_worker_process_cycle (nginx + 0x3c806)
                                                   #4  0x000000000043b093 ngx_spawn_process (nginx + 0x3b093)
                                                   #5  0x000000000043bb76 ngx_start_worker_processes (nginx + 0x3bb76)
                                                   #6  0x000000000043ce8d ngx_master_process_cycle (nginx + 0x3ce8d)
                                                   #7  0x000000000041621c main (nginx + 0x1621c)
                                                   #8  0x00007f86336295d0 __libc_start_call_main (libc.so.6 + 0x295d0)
                                                   #9  0x00007f8633629680 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29680)
                                                   #10 0x0000000000414a55 _start (nginx + 0x14a55)
                                                   ELF object binary architecture: AMD x86-64

Mar 20 07:50:56 ch00salmngx01 systemd[1]: systemd-coredump@0-19775-0.service: Deactivated successfully.

No idea what caused this issue…

Those errors looks like what happens when you try to start NGINX and either a) nginx is still running and bound to the ports(ie 0.0.0.0:443 for HTTPS), or b) there are other processes listening on those ports already(especially 0.0.0.0:22, as I assume this is a linux node with SSH running). In order to proxy that both those ports, you will need to have it configured to have a secondary IP that SSH is not running on, as any port can only have a single process bound to it.

Hello Damian_Curry

That’s my output about listening ports 22, 2222 (sshd), 443

[root@ch00salmngx01 a-dabrowsk-1@domain.local]# netstat -tulpn | grep LISTEN | grep 2222
> tcp 0 0 0.0.0.0:2222 0.0.0.0: LISTEN 1075/sshd: /usr/sbi*
tcp6 0 0 :::2222 :::* LISTEN 1075/sshd: /usr/sbi
[root@ch00salmngx01 a-dabrowsk-1@domain.local]# netstat -tulpn | grep LISTEN | grep 22
tcp 0 0 0.0.0.0:22 0.0.0.0: LISTEN 5974/nginx: master*
> tcp 0 0 0.0.0.0:2222 0.0.0.0:* LISTEN 1075/sshd: /usr/sbi
tcp6 0 0 :::2222 :::* LISTEN 1075/sshd: /usr/sbi
[root@ch00salmngx01 a-dabrowsk-1@domain.local]# netstat -tulpn | grep LISTEN | grep 443
> tcp 0 0 0.0.0.0:443 0.0.0.0: LISTEN 5974/nginx: master*

So nginx is working as expected until there are multiple ssh connections initiated? Is there anything in the nginx error log?

Hi @Damian_Curry

No idea what is causing this but Nginx is listening on two ports: 22 and 443.
Incoming SSH requests are forwarded to our backend servers.

Yesterday I reinstalled Nginx along with NTLM module but still issue persists:

That’s the error from dmesg:

[32152.462132] nginx[30401]: segfault at ffffffff ip 00007ff13b8c57a4 sp 00007ffc67afbdb0 error 6 in ngx_http_upstream_ntlm_module.so[7ff13b8c5000+1000] likely on CPU 1 (core 1, socket 0)
[32152.464504] Code: 00 00 48 89 ef e8 74 fe ff ff 48 8b 53 10 48 8b 43 08 48 89 02 48 8b 53 10 48 89 50 08 49 8b 54 24 18 48 89 53 10 48 8d 43 08 <48> 89 02 49 8d 54 24 10 48 89 53 08 49 89 44 24 18 48 83 c4 10 5b
[32253.294804] nginx[31474]: segfault at 28 ip 0000000000464f52 sp 00007ffc67afbdd0 error 4 in nginx[412000+ad000] likely on CPU 0 (core 0, socket 0)
[32253.296084] Code: 09 f7 e9 63 ff ff ff 48 89 df ff 53 38 4c 89 e7 e8 d4 f5 fe ff eb a8 53 48 8b 07 48 8b 00 48 8b 48 48 48 8b 58 08 48 8b 53 50 <48> 8b 52 28 48 89 42 10 0f b6 57 09 83 e2 14 80 fa 14 74 19 f6 47
[32255.770755] nginx[29771]: segfault at 28 ip 0000000000464f52 sp 00007ffc67afbdd0 error 4 in nginx[412000+ad000] likely on CPU 1 (core 1, socket 0)
[32255.771673] Code: 09 f7 e9 63 ff ff ff 48 89 df ff 53 38 4c 89 e7 e8 d4 f5 fe ff eb a8 53 48 8b 07 48 8b 00 48 8b 48 48 48 8b 58 08 48 8b 53 50 <48> 8b 52 28 48 89 42 10 0f b6 57 09 83 e2 14 80 fa 14 74 19 f6 47
[32306.003382] nginx[31513]: segfault at 28 ip 0000000000464f52 sp 00007ffc67afbdd0 error 4 in nginx[412000+ad000] likely on CPU 1 (core 1, socket 0)
[32306.004274] Code: 09 f7 e9 63 ff ff ff 48 89 df ff 53 38 4c 89 e7 e8 d4 f5 fe ff eb a8 53 48 8b 07 48 8b 00 48 8b 48 48 48 8b 58 08 48 8b 53 50 <48> 8b 52 28 48 89 42 10 0f b6 57 09 83 e2 14 80 fa 14 74 19 f6 47

and that’s from error.log:

2025/04/10 08:00:07 [error] 31933#31933: *44831 upstream timed out (110: Connection timed out) while reading response header from upstream, client: ip.address, server: qa.environment.domain.local, request: “POST /IND_RD_Innovation/software_strategy/_apis/wit/wiql?api-version=2.0 HTTP/1.1”, upstream: “https://ip.address/IND_RD_Innovation/software_strategy/_apis/wit/wiql?api-version=2.0”, host: “qa.environment.domain.local”

Just to be explicit here. I am using NTLM module from here:

I have a feeling the segfault is being caused by the NTLM module. Have you tried testing without the module and see if you can recreate the failure? Seeing as the repo has not been updated in ~4 years, it most likely has not been tested against more recent versions of NGINX. There is a good chance there have been changes in NGINX that effect how the module interacts and is causing these segfaults.

Hi @Damian_Curry

I had the same impression from the beggining but needed to check it from my end first. My Nginx server is running on ESX host. Just right before faults appeared someone moved the server from one host to different what caused the change of hardware architecture related to memory and CPU. I hoped that was the issue thus I have decided to reinstall the service again from source and this time add NTLM module dynamically. But it didn’t help. However new segmentation faults were indicating exactly to nginx-ntlm-module.so module…

Yestarday I have acquired Nginx Plus from F5 support.

After installation and restoring our config all segfaults are gone.

Thank you.