Andras Dosztal
Andras Dosztal
Network architect
May 18, 2016 5 min read

Daily Tshoot: Identifying Misconfigured Servers With Cisco NX-OS

thumbnail for this post

It’s 1 AM when you get a call from the installation team who’s been working on a new server for hours, stating that the server is configured, the technician is at the console, but they can’t reach remote locations. “It must be the network!” you hear, maybe not the first time during your career. 😄 This post goes through some the troubleshooting steps in a scenario like this. We’re starting from the point when you checked if the interfaces are up.

Topology

We have a core layer (N7K-1/2) connected to the corporate network, and an aggregation layer (N7K-11/12) providing network access in the data center. We’re running OSPF and having ECMP routes between the two layers.

Topology

Note: This lab is created to be simple as possible; in a real world scenario there would be access/leaf switches below the aggregation layer. IP addresses:

  • Server: 192.168.1.55/24
  • Gateway HSRP: 192.168.1.1/24

Troubleshooting steps

As I mentioned in the introduction, the initial report says the server is unable to communicate with the corporate network.

Step 1: Checks from the core layer

First check if the subnet is being advertised:

N7K-1# sh ip route 192.168.1.55
IP Route Table for VRF "default"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

192.168.1.0/24, ubest/mbest: 2/0
    *via 10.0.0.2, Po11, [110/11], 1w1d, ospf-1, intra
    *via 10.0.1.2, Po12, [110/11], 1w1d, ospf-1, intra

Then ping the HSRP IP:

N7K-11# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data bytes
64 bytes from 192.168.1.55: icmp_seq=0 ttl=126 time=4.626 ms
64 bytes from 192.168.1.55: icmp_seq=1 ttl=126 time=0.554 ms
64 bytes from 192.168.1.55: icmp_seq=2 ttl=126 time=0.541 ms
64 bytes from 192.168.1.55: icmp_seq=3 ttl=126 time=0.557 ms
64 bytes from 192.168.1.55: icmp_seq=4 ttl=126 time=0.62 ms

We can safely say that routing and the FHRP works fine. Now let’s see if the server responds to ping:

N7K-1# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data b
Request 0 timed out
Request 1 timed out
Request 2 timed out
^C

Step 2: Checks on the aggregation layer

Let’s see if we can ping the server from here:

N7K-11# ping 192.168.1.55
PING 192.168.1.55 (192.168.1.55): 56 data bytes
64 bytes from 192.168.1.55: icmp_seq=0 ttl=126 time=4.626 ms
64 bytes from 192.168.1.55: icmp_seq=1 ttl=126 time=0.554 ms
64 bytes from 192.168.1.55: icmp_seq=2 ttl=126 time=0.541 ms
64 bytes from 192.168.1.55: icmp_seq=3 ttl=126 time=0.557 ms
64 bytes from 192.168.1.55: icmp_seq=4 ttl=126 time=0.62 ms

Yep, it’s available.

Step 3: Preliminary conclusion

If a server is pingable from its own subnet but not from outside, it will usually be either a firewall policy or a misconfigured host problem. For simplicity we’re not using any firewalls here, so we’re quite sure the gateway IP is misconfigured on the server. But how can we prove it? Well, let’s check if the server is trying to resolve another address.

Step 4: Capturing ARP traffic from server

If a host tries to access a device on the network, it resolves the destination’s MAC address using ARP. If it tries to communicate with a non-used IP, it sends ARP requests continuously. Let’s catch this behavior! First, look up the server’s MAC address:

N7K-11# sh ip arp 192.168.1.55
Flags: * - Adjacencies learnt on non-active FHRP router
       + - Adjacencies synced via CFSoE
       # - Adjacencies Throttled for Glean
       D - Static Adjacencies attached to down interface

IP ARP Table
Total number of entries: 1
Address         Age       MAC Address     Interface
192.168.1.55  00:17:48  0016.3500.0001    Vlan100

Now we just have to use ethanalyzer to see if there are repeated ARP queries from 0016.3500.0001:

N7K-11# ethanalyzer local interface inband capture-filter "arp && ether host 0016.3500.0011"
Capturing on inband
1 2016-05-18 10:34:33.561866 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
2 2016-05-18 10:34:35.370072 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
3 2016-05-18 10:34:38.568311 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
4 2016-05-18 10:34:40.370053 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
5 2016-05-18 10:34:42.370027 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
7 2016-05-18 10:34:43.491489 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
8 2016-05-18 10:34:44.577075 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
9 2016-05-18 10:34:46.369973 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55
10 2016-05-18 10:34:47.489782 Hewlett-_00:00:01 -> Broadcast    ARP 60 Who has 192.168.1.11?  Tell 192.168.1.55

Ta-dah! Tell the installation team to change the gateway IP from .11 to .1.

A possible workaround

Let’s assume the onsite team is not able to change the gateway IP because they’re only responsible for the application hosted on the server, and the OS team has already left (not that something like this would ever happen 😄). They have to finish their work otherwise the company has to pay a penalty, so they ask you to make it somehow working. There’s a way to solve the problem, you can add the mistyped IP as a secondary HSRP IP (highlighted):

N7K-11(config-if)# sh run int vl100

interface Vlan100
  description Application_servers_192.168.1.0/24
  no shutdown
  no ip redirects
  ip address 192.168.1.2/26
  ip ospf passive-interface
  ip router ospf 1 area 0.0.0.0
  ip pim sparse-mode
  hsrp 1
    preempt
    priority 150
    ip 192.168.1.1
    ip 192.168.1.11 secondary

You can later remove it when the OS team will have the issue resolved.