Daily Tshoot: Identifying Misconfigured Servers With Cisco NX-OS
It’s 1 AM when you get a call from the installation team who’s been working on a new server for hours, stating that the server is configured, the technician is at the console, but they can’t reach remote locations. “It must be the network!” you hear, maybe not the first time during your career. 😄 This post goes through some the troubleshooting steps in a scenario like this. We’re starting from the point when you checked if the interfaces are up.
Topology
We have a core layer (N7K-1/2) connected to the corporate network, and an aggregation layer (N7K-11/12) providing network access in the data center. We’re running OSPF and having ECMP routes between the two layers.
Note: This lab is created to be simple as possible; in a real world scenario there would be access/leaf switches below the aggregation layer. IP addresses:
- Server: 192.168.1.55/24
- Gateway HSRP: 192.168.1.1/24
Troubleshooting steps
As I mentioned in the introduction, the initial report says the server is unable to communicate with the corporate network.
Step 1: Checks from the core layer
First check if the subnet is being advertised:
N7K-1# sh ip route 192.168.1.55
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>
192.168.1.0/24, ubest/mbest: 2/0
*via 10.0.0.2, Po11, [110/11], 1w1d, ospf-1, intra
*via 10.0.1.2, Po12, [110/11], 1w1d, ospf-1, intra
Then ping the HSRP IP:
N7K-11# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data bytes
64 bytes from 192.168.1.55: icmp_seq=0 ttl=126 time=4.626 ms
64 bytes from 192.168.1.55: icmp_seq=1 ttl=126 time=0.554 ms
64 bytes from 192.168.1.55: icmp_seq=2 ttl=126 time=0.541 ms
64 bytes from 192.168.1.55: icmp_seq=3 ttl=126 time=0.557 ms
64 bytes from 192.168.1.55: icmp_seq=4 ttl=126 time=0.62 ms
We can safely say that routing and the FHRP works fine. Now let’s see if the server responds to ping:
N7K-1# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data b
Request 0 timed out
Request 1 timed out
Request 2 timed out
^C
Step 2: Checks on the aggregation layer
Let’s see if we can ping the server from here:
N7K-11# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data bytes
64 bytes from 192.168.1.55: icmp_seq=0 ttl=126 time=4.626 ms
64 bytes from 192.168.1.55: icmp_seq=1 ttl=126 time=0.554 ms
64 bytes from 192.168.1.55: icmp_seq=2 ttl=126 time=0.541 ms
64 bytes from 192.168.1.55: icmp_seq=3 ttl=126 time=0.557 ms
64 bytes from 192.168.1.55: icmp_seq=4 ttl=126 time=0.62 ms
Yep, it’s available.
Step 3: Preliminary conclusion
If a server is pingable from its own subnet but not from outside, it will usually be either a firewall policy or a misconfigured host problem. For simplicity we’re not using any firewalls here, so we’re quite sure the gateway IP is misconfigured on the server. But how can we prove it? Well, let’s check if the server is trying to resolve another address.
Step 4: Capturing ARP traffic from server
If a host tries to access a device on the network, it resolves the destination’s MAC address using ARP. If it tries to communicate with a non-used IP, it sends ARP requests continuously. Let’s catch this behavior! First, look up the server’s MAC address:
N7K-11# sh ip arp 192.168.1.55Flags: * - Adjacencies learnt on non-active FHRP router
+ - Adjacencies synced via CFSoE
# - Adjacencies Throttled for Glean
D - Static Adjacencies attached to down interface
IP ARP Table
Total number of entries: 1
Address Age MAC Address Interface
192.168.1.55 00:17:48 0016.3500.0001 Vlan100
Now we just have to use ethanalyzer to see if there are repeated ARP queries from 0016.3500.0001:
N7K-11# ethanalyzer local interface inband capture-filter "arp && ether host 0016.3500.0011"
Capturing on inband
1 2016-05-18 10:34:33.561866 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
2 2016-05-18 10:34:35.370072 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
3 2016-05-18 10:34:38.568311 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
4 2016-05-18 10:34:40.370053 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
5 2016-05-18 10:34:42.370027 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
7 2016-05-18 10:34:43.491489 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
8 2016-05-18 10:34:44.577075 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
9 2016-05-18 10:34:46.369973 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
10 2016-05-18 10:34:47.489782 Hewlett-_00:00:01 -> Broadcast ARP 60 Who has 192.168.1.11? Tell 192.168.1.55
Ta-dah! Tell the installation team to change the gateway IP from .11 to .1.
A possible workaround
Let’s assume the onsite team is not able to change the gateway IP because they’re only responsible for the application hosted on the server, and the OS team has already left (not that something like this would ever happen 😄). They have to finish their work otherwise the company has to pay a penalty, so they ask you to make it somehow working. There’s a way to solve the problem, you can add the mistyped IP as a secondary HSRP IP (highlighted):
N7K-11(config-if)# sh run int vl100
interface Vlan100
description Application_servers_192.168.1.0/24
no shutdown
no ip redirects
ip address 192.168.1.2/26
ip ospf passive-interface
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
hsrp 1
preempt
priority 150
ip 192.168.1.1
ip 192.168.1.11 secondary
You can later remove it when the OS team will have the issue resolved.