nginx Cannot Bind in Docker Container

Hamza Sheikh

2024-12-04 18:17

The problem my team faced was that a container running nginx was starting but the error message was that the port was already in use.

[emerg] 1#1: bind() to 0.0.0.0:8002 failed (98: Address already in use)

The salient configuration of the container was,

$ docker inspect container-with-problem | jq '.[] | {"cmd": .Config.Cmd, "network_mode": .HostConfig.NetworkMode, "port_bindings": .HostConfig.PortBindings, "exposed_ports": .Config.ExposedPorts, "ports": .NetworkSettings.Ports, "bridge": .NetworkSettings.Bridge, "ip_address": .NetworkSettings.IPAddress, "networks": .NetworkSettings.Networks, "volume_mounts": .Mounts}'
{
    "cmd": [
        "bash",
        "-c",
        "nginx -c /etc/nginx/nginx.conf;"
    ],
    "network_mode": "host",
    "port_bindings": null,
    "exposed_ports": {
        "8002/tcp": {}
    },
    "ports": {},
    "bridge": "",
    "ip_address": "",
    "networks": {},
    "volume_mounts": [
        {
            "Type": "bind",
            "Source": "REDACTED",
            "Destination": "REDACTED",
            "Mode": "ro",
            "RW": false,
            "Propagation": "rprivate"
        },
        {
            "Type": "bind",
            "Source": "REDACTED",
            "Destination": "REDACTED",
            "Mode": "ro",
            "RW": false,
            "Propagation": "rprivate"
        },
        {
            "Type": "bind",
            "Source": "REDACTED",
            "Destination": "REDACTED",
            "Mode": "rw",
            "RW": true,
            "Propagation": "rprivate"
        }
    ]
}

I checked who was using port 8002,

$ ss -tplan | grep :8002 | awk '{print $4}' | sort | uniq
10.A.B.6:8002
127.0.0.1:7000
127.0.0.1:8001

I wasn't expecting 10.A.B.6:8002. Instead, I expected 127.0.0.1:8002. 10.A.B.6 was the host IP address. The container should bind to localhost not the host address.

I looked at the open files,

$ sudo lsof -i :8002
lsof: no pwd entry for UID 101
lsof: no pwd entry for UID 101
lsof: no pwd entry for UID 101
COMMAND     PID             USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
lsof: no pwd entry for UID 101
nginx      6510              101   25u  IPv4 306195896      0t0  TCP redacted.redacted.redacted.com:teradataordbms->redacted.redacted.redacted.com:redactedport (ESTABLISHED)
nginx    161218 systemd-coredump 2387u  IPv4 306167339      0t0  TCP redacted.redacted.redacted.com:teradataordbms->redacted.redacted.redacted.com:redactedport (ESTABLISHED)
lsof: no pwd entry for UID 101
nginx   4001159              101   71u  IPv4 306211025      0t0  TCP redacted.redacted.redacted.com:teradataordbms->redacted.redacted.redacted.com:redactedport (ESTABLISHED)
lsof: no pwd entry for UID 101
nginx   4001159              101  154u  IPv4 306211044      0t0  TCP redacted.redacted.redacted.com:teradataordbms->redacted.redacted.redacted.com:redactedport (ESTABLISHED)
nginx   4179792 systemd-coredump   47u  IPv4 306103886      0t0  TCP redacted.redacted.redacted.com:teradataordbms->redacted.redacted.redacted.com:redactedport (ESTABLISHED)

Notice systemd-coredump was the user for the open files using port 8002 and nginx was the process that opened the file. systemd-coredump has user ID 999 and group ID 997 on the host. The container did not have that user ID or group ID at all.

$ id -u systemd-coredump
999

$ id -g systemd-coredump
997

I found a reference which I cannot substantiate from official documentation, (source: Question: Docker install on linux sets data dir owner to systemd-coredump user)

directory is owned by systemd-coredump, which apparently happens when the kernel crashes in the middle of some operation

While I don't know exactly what happened, the symptoms were:

nginx in the container could not bind to port 8002
the container was in restarting loop
nginx process(es) in some container(s) held open a file on port 8002
the nginx process(es) were not always the same whenever lsof was run
there were active network connections between the seemingly orphaned nginx process(es) and "upstream" servers

I had found the port conflict which was a great first step.

The next realization was that since I had not set up something to listen to port 8002 other than the container, there was some rogue (since it was unknown to me) process that was randomly using the same port as I needed for the container.

My team had run into this problem before.

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 1024 65535

The range of local ports was 1024 to 65535. 8002 falls within that range. It was just randomness that resulted in something using the same port as I needed, thus the conflict.

The fix proposed by my team was to modify the port range to something we don't use for our containers.

$ sudo sysctl -w net.ipv4.ip_local_port_range="25000 65535"
net.ipv4.ip_local_port_range = 25000 65535

$ echo 'net.ipv4.ip_local_port_range = 25000 65535' | sudo tee -a /etc/sysctl.conf

$ sudo sysctl -p

Restarted Docker,

$ sudo systemctl restart docker

Restarted container,

$ docker restart container-with-problem

And the problem went away. The container started and remained healthy.