The cost of complexity: Ansible AWX
May 05, 2024
Analyzing a network performance issue
In this case, the application team is reporting a random high RTT between a client and a server. Both client and server are Linux virtual machines, under VMware vSphere, and the path is through a lot of firewalls, routers, switches, load balancers… and so on.
The issue
The very first thing is to replicate the issue from the client:
PORT=6000; while true; do echo -ne "${PORT} "; nmap -n -g ${PORT} -p 143 10.0.1.1 | grep scanned; PORT=$((${PORT} + 1)); done
After a while I could see some requests going over one second:
6000 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6001 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6002 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6003 Nmap finished: 1 IP address (1 host up) scanned in 0.116 seconds
6004 Nmap finished: 1 IP address (1 host up) scanned in 0.118 seconds
6005 Nmap finished: 1 IP address (1 host up) scanned in 3.861 seconds
6006 Nmap finished: 1 IP address (1 host up) scanned in 0.116 seconds
6007 Nmap finished: 1 IP address (1 host up) scanned in 0.119 seconds
6008 Nmap finished: 1 IP address (1 host up) scanned in 0.128 seconds
6009 Nmap finished: 1 IP address (1 host up) scanned in 0.113 seconds
The issue is now confirmed. Before continuing, be sure the client and server have both clocks synchronized.
client# clock
Fri 05 Feb 2016 12:28:57 PM CET -0.016003 seconds
server# clock
Fri 05 Feb 2016 12:26:57 PM CET -0.347022 seconds
Supposing the network is stale and deterministic, the next step will require checking the end-to-end path on both client and server:
client# tcpdump -i eth0 -s 0 -nn -w /tmp/dump.pcap host 10.0.2.2 and portrange 6000-6999 and host 10.0.1.1 and port 143
server# tcpdump -i eth0 -s 0 -nn -w /tmp/dump.pcap host 10.0.2.2 and portrange 6000-6999 and host 10.0.1.1 and port 143
The source port is set to a fixed number so the tcpdump filter can be configured to match test requests only.
After a couple of requests with high latency, both tcpdump can be stopped and analyzed.
Let’s review some high latency requests:
6007 Nmap finished: 1 IP address (1 host up) scanned in 3.285 seconds
6089 Nmap finished: 1 IP address (1 host up) scanned in 2.911 seconds
6200 Nmap finished: 1 IP address (1 host up) scanned in 4.025 seconds
Let’s dump the request with source port 6088 on the client:
client# tcpdump -r /tmp/dump.pcap -nn "port 6200 and tcp[tcpflags] & (tcp-syn) != 0"
reading from file /tmp/dump.pcap, link-type EN10MB (Ethernet)
12:35:02.639543 IP 10.0.2.2.6200 > 10.0.1.1.143: S 2917316624:2917316624(0) win 2048 <mss 1460>
12:35:02.640273 IP 10.0.1.1.143 > 10.0.2.2.6200: S 2145376395:2145376395(0) ack 2917316625 win 5840 <mss 1460>
server# tcpdump -r /tmp/dump.pcap -nn "port 6200 and tcp[tcpflags] & (tcp-syn) != 0"
reading from file /tmp/dump.pcap, link-type EN10MB (Ethernet)
12:35:02.635480 IP 10.0.2.2.6200 > 10.0.1.1.143: S 2917316624:2917316624(0) win 2048 <mss 1460>
12:35:02.635488 IP 10.0.1.1.143 > 10.0.2.2.6200: S 2145376395:2145376395(0) ack 2917316625 win 5840 <mss 1460>
So, we have the following timeline:
- 12:34:29.573249 client sends the SYN packet
- 12:34:29.569671 server receives the SYN packet
- 12:34:29.569686 server send back the SYN+ACK packet
- 12:34:29.574055 client receive the SYN+ACK packet
Besides a slight clock misalignment, all packets were sent and received back in less than 3 milliseconds. So the network is not the cause of the latency. The cause is the source host, spending time managing processes.