TCP Socket Pressure

The TCP/IP Ephemeral Port Range could be a problem on systems that must create many outgoing connections: a FTP daemon serving non-PASV traffic, or perhaps a system that makes many web services calls in response to client requests. Older systems may use a very small range by default—some 4000 ports—while most modern OS have increased the range to 16K or more. Despite these increases, a very busy system could still run out of ephemeral ports, depending on:

  1. The number of simultaneous incoming client requests. These are assumed to be modern FTP or HTTP type requests that reach a single or otherwise small set of ports well below the ephemeral range.
  2. The number of outgoing ephemeral connections required for each client request. Critical system and site specific processes may also require ephemeral ports, though these are not expected to consume a significant portion of the ephemeral ports. (These system processes might be adversely affected should the ephemeral port range be exhausted, though.)
  3. If TCP, the duration of the ephemeral sessions, including any time spent in TIME_WAIT.
  4. Any other related factors, such as the tcp_tw_reuse or tcp_tw_recycle settings on Linux.

Symptoms may include TCP connections lingering in TIME_WAIT, unexpected system load, or request latency and timeouts. While a site should definitely monitor the number of connections in TIME_WAIT and other relevant metrics over time, audit scripts should also be available to review the settings on a particular host, as a problematic host might have an incorrect setting not (as yet) controlled under configuration management:

#!/bin/sh # TODO this is only for Linux, not other Unix. netstat -tn | awk '/^tcp/{print $NF}' | sort | uniq -c echo -n "Ephemeral Port Range: " cat /proc/sys/net/ipv4/ip_local_port_range echo -n "TCP FIN Timeout (sec): " cat /proc/sys/net/ipv4/tcp_fin_timeout echo -n "TCP TW reuse?: " cat /proc/sys/net/ipv4/tcp_tw_reuse echo -n "TCP TW recycle?: " cat /proc/sys/net/ipv4/tcp_tw_recycle

Experiments to both mostly or totally consume the ephemeral port range would be productive experiments to run in a lab, either by reducing the ephemeral port range, or by consuming all 16K or more ephemeral ports. The socket memory requirements of such a load might be the bottleneck, or the system could instead exhibit strange edge case behavior. The lab environment findings may improve production monitoring and debugging techniques, or provide advice on how to better code services to gracefully survive ephemeral port exhaustion.

In production, the exhaustion would likely occur either due to increased client load (e.g. the slashdot effect), or perhaps following a scheduled change, when new software pushes the system to new limits, or a combination of both new software and heavy load.

Ideas for mitigation of this issue, in the probably rare case of it occurring:

Technorati Tags: