Too many open files at 1000 req/sec

Hi,

I’m seeking advice on the most robust way to configure Nginx for a specific scenario that led to a caching issue.

I run a free vector tile map service (https://openfreemap.org/). The server’s primary job is to serve a massive number of small (~70 kB), pre-gzipped PBF files.

To optimize for ocean areas, tiles that don’t exist on disk should be served as a 200 OK with an empty body. These are then rendered as empty space on the map.

Recently, the server experienced an extremely high load: 100k req/sec on Cloudflare, and 1k req/sec on my two Hetzner servers. During this peak, Nginx started serving some existing tiles as empty bodies. Because these responses included cache-friendly headers (expires 10y), the CDN cached the incorrect empty responses, effectively making parts of the map disappear until a manual cache purge was performed.

My goal is to prevent this from happening again. A temporary server overload should result in a server error (e.g., 5xx), not incorrect content that gets permanently cached.

The Nginx error logs clearly showed the root cause of the system error:

2025/08/08 23:08:16 [crit] 1084275#1084275: *161914910 open() "/mnt/ofm/planet-20250730_001001_pt/tiles/8/138/83.pbf" failed (24: Too many open files), client: 172.69.122.170, server: ...

It appears my try_files directive interpreted this “Too many open files” error as a “file not found” condition and fell back to serving the empty tile.

System and Nginx Diagnostic Information

Here is the relevant information about the system and Nginx process state (captured at normal load, after I solved the high traffic incident, still showing high FD usage on one worker).

  • OS: Ubuntu 22.04 LTS, 64 GB RAM, local NVME SSD, physical server (not VPS)

  • nginx version: nginx/1.27.4

  • Systemd ulimit for nofile:

    # cat /etc/security/limits.d/limits1m.conf
    - soft nofile 1048576
    - hard nofile 1048576
    
  • Nginx Worker Process Limits (worker_rlimit_nofile is set to 300000):

    # for pid in $(pgrep -f "nginx: worker"); do sudo cat /proc/$pid/limits | grep "Max open files"; done
    Max open files            300000               300000               files
    Max open files            300000               300000               files
    ... (all 8 workers show the same limit)
    
  • Open File Descriptor Count per Worker:

    # for pid in $(pgrep -f "nginx: worker"); do count=$(sudo lsof -p $pid 2>/dev/null | wc -l); echo "nginx PID $pid: $count open files"; done
    nginx PID 1090: 57 open files
    nginx PID 1091: 117 open files
    nginx PID 1092: 931 open files
    nginx PID 1093: 65027 open files
    nginx PID 1094: 7449 open files
    ...
    

    (Note the one worker with a very high count, ~98% of which are regular files).

  • sysctl fs.file-max:

    fs.file-max = 9223372036854775807
    
  • systemctl show nginx | grep LimitNOFILE:

    LimitNOFILE=524288
    LimitNOFILESoft=1024
    

Relevant Nginx Configuration

Here are the key parts of my configuration that led to the issue.

worker_processes auto;
worker_rlimit_nofile 300000;

events {
    worker_connections 40000;
    multi_accept on;
}

http {
    open_file_cache max=1000000 inactive=60m;
    open_file_cache_valid 60m;
    open_file_cache_min_uses 1;
    open_file_cache_errors on;
    # ...

server block tile serving logic:

location ^~ /monaco/20250806_231001_pt/ {
    alias /mnt/ofm/monaco-20250806_231001_pt/tiles/;
    try_files $uri @empty_tile;
    add_header Content-Encoding gzip;

    expires 10y;

    types {
        application/vnd.mapbox-vector-tile pbf;
    }

    add_header 'Access-Control-Allow-Origin' '*' always;
    add_header Cache-Control public;
    add_header X-Robots-Tag "noindex, nofollow" always;

    add_header x-ofm-debug 'specific PBF monaco 20250806_231001_pt';
}

location @empty_tile {
    return 200 '';

    expires 10y;

    types {
        application/vnd.mapbox-vector-tile pbf;
    }

    add_header 'Access-Control-Allow-Origin' '*' always;
    add_header Cache-Control public;
    add_header X-Robots-Tag "noindex, nofollow" always;

    add_header x-ofm-debug 'empty tile';
}

Full generated config is uploaded here: openfreemap/docs/assets/nginx.conf at 1a700fd5df4b0769a834a0564bb6900b3a116efd · hyperknot/openfreemap · GitHub

Questions

1. I think multi_accept + open_file_cache > worker_rlimit_nofile is causing the whole trouble by not distributing the requests across workers, and then reaching the limit. Can you confirm if this is the correct take?
2. How should I handle the “missing file should be empty response, server error should be 5xx” scenario? I’ve asked 5 LLMs and each gave different answers, which I’m including below. I’d like to ask your expert opinion, and not trust LLMs in this regard.

o3

error_page 404 = @empty_tile;

Gemini

if (!-f $request_filename) {
    return 200 '';
}

Opus

location ^~ /{area}/{version}/ {
  # Check if file exists without opening it
  if (!-f $request_filename) {
      return 404;
  }

  # File exists, try to serve it
  try_files $uri =503;  # Return 503 if can't open (system error)

  add_header Content-Encoding gzip;
  expires 10y;

  types {
      application/vnd.mapbox-vector-tile pbf;
  }

  add_header 'Access-Control-Allow-Origin' '*' always;
  add_header Cache-Control public;
  add_header X-Robots-Tag "noindex, nofollow" always;
  add_header x-ofm-debug 'specific PBF {area} {version}';
}

# Handle 404s (file doesn't exist) - serve empty tile
error_page 404 = @empty_tile_safe;

# Handle 503s (system errors) - don't cache!
error_page 503 = @system_error;

location @empty_tile_safe {
    return 200 '';

    expires 10y;

    types {
        application/vnd.mapbox-vector-tile pbf;
    }

    add_header 'Access-Control-Allow-Origin' '*' always;
    add_header Cache-Control public;
    add_header X-Robots-Tag "noindex, nofollow" always;
    add_header x-ofm-debug 'empty tile (intentional)';
}

location @system_error {
    return 503 'Service temporarily unavailable';

    # SHORT cache for errors - don't poison the CDN cache!
    expires 5s;

    add_header 'Access-Control-Allow-Origin' '*' always;
    add_header Cache-Control "no-cache, must-revalidate";
    add_header Retry-After "5" always;
    add_header x-ofm-debug 'system error - temporary';
}

3. open_file_cache Tuning: My current open_file_cache settings are clearly too aggressive and caused the problem. For a workload of millions of tiny, static files, what would be considered a good configuration for max, inactive, and min_uses?

4. open_file_cache_errors: Should this be on or off? My intent for having it on was to cache the “not found” status for ocean tiles to reduce disk checks. I want to cache file-not-found scenarios, but not server errors. What is the correct usage in this context?

5. Limits: What values would you recommend for values like worker_rlimit_nofile and worker_connections? Should I raise LimitNOFILESoft?

3 Likes

Hi @hyperknot! First things first, this is really cool project! One my colleagues coincidentally ran across your blog post over the weekend and I loved reading through it :grin:

I have shared your questions internally and in addition to answers, one of my colleagues has opened a PR in the OpenFreeMap repo with some suggestions! Now, in so far as answers to your questions:

1. I think multi_accept + open_file_cache > worker_rlimit_nofile is causing the whole trouble by not distributing the requests across workers, and then reaching the limit. Can you confirm if this is the correct take?

You are indeed correct that this could be a point of failure under load. It’s hard to know if this was 100% the cause of the issue here, but it’s something to look into.

2. How should I handle the “missing file should be empty response, server error should be 5xx” scenario? I’ve asked 5 LLMs and each gave different answers, which I’m including below. I’d like to ask your expert opinion, and not trust LLMs in this regard.

The suggestion made by the Opus LLM is quite decent. Obvious AI/LLM caveats apply in that it might not work out of the box, but using if and try_files to check if a file exists and return a 503 if it can’t be opened works fine.

3. open_file_cache Tuning: My current open_file_cache settings are clearly too aggressive and caused the problem. For a workload of millions of tiny, static files, what would be considered a good configuration for max, inactive, and min_uses?

As with most tuning questions, there really isn’t an answer here. You probably want to tweak the values and benchmark your system (looking at your repo it seems as if you are already actively conducting benchmarks!)

4.open_file_cache_errors: Should this be on or off? My intent for having it on was to cache the “not found” status for ocean tiles to reduce disk checks. I want to cache file-not-found scenarios, but not server errors. What is the correct usage in this context?

You can leave this to on. Per our docs, only open_file_cache errors are logged. Server errors are not open_file_cache errors.

5. Limits: What values would you recommend for values like worker_rlimit_nofile and worker_connections? Should I raise LimitNOFILESoft?

Setting worker_rlimit_nofile to 2x worker_connections tends to be a good standard. Each worker will usually receive a request, open a connection to either your upstream or a local path, and then maybe do something that involves caching and/or files.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.