Search This Blog

Monday, August 6, 2012

I'm using Rackconnect and after my cloud server is built I can't ping anything

Rackconnect is Rackspace product that allows you to use a dedicated as well as the cloud products (cloud servers, cloud files and others). It provides cloud servers a secure way to connect to the dedicated and other way around. The majority of implementations are done by using F5 load balancers and Cisco ASA firewalls but company has plans to support other vendors as well. To have a better understanding of RC an implementation can be done for example on:
  • standalone ASA
  • standalone F5
  • HA ASA 
  • HA F5 
  • single ASA and single F5 
  • single ASA and HA f5 
  • HA ASA and single F5 
  • HA ASA and HA F5
There are sometimes situations when the things don't work as expected. It can be because of a mis-configuration. In a very rare cases because of bugs. Below is an example of a problem I was ask to help to troubleshoot.

Problem 

My newly built cloud server is isolated. After logging over console I can't ping any other servers. Example:
  • I can't ping my default gateway from the cloud
  • I can't ping any other internal or external address (example Google DNS 8.8.8.8)
Troubleshooting and Analysis

The customer is using Rackconnect. That makes all the network and routing configuration a little bit different comparing to how we set up a standard cloud server without Rackconnect.

Example of a route output from a cloud servers that is rackconnected:

$ ip r | sort | column -t
10.176.0.0/12    via  10.176.0.1     dev    eth1
10.176.0.0/18    dev  eth1           proto  kernel  scope   link  src  10.176.4.37
10.191.192.0/18  via  10.176.0.1     dev    eth1
default          via  10.176.11.111  dev    eth1

For a comparison a routing table from a cloud server that belongs to a cloud account that is not linked to Rackconnect:

# ip r | sort | column -t
10.176.0.0/12     via  10.177.128.1   dev    eth1
10.177.128.0/18   dev  eth1           proto  kernel  scope   link  src  10.177.132.15
10.191.192.0/18   via  10.177.128.1   dev    eth1
164.177.146.0/24  dev  eth0           proto  kernel  scope   link  src  164.177.146.87
default           via  164.177.146.1  dev    eth0    metric  100

Starting troubleshooting I was able to reproduce the issue and confirm that you can't ping the default gateway or anything internal or external IP:

$ ping 10.176.11.111
PING 10.176.11.111 (10.176.11.111) 56(84) bytes of data.
^C
--- 10.176.11.111 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 1293ms

Looking further I saw that the host fails to resolve the IP 10.176.11.111 (default gateway) to its MAC address. The following output from the cloud server confirmed this.

$ arp -an
? (10.176.11.111) at incomplete on eth1

As the customer was using RC with F5 load balancer i went there to check what traffic was hitting the LB. I confirmed that the cloud default gateway is defined on F5 as a self ip object so F5 should be responding to the pings.

# tmsh list /net self
net self 10.176.11.111/18 {
    vlan hybridServiceNet-100
}

When pinging from cloud server I saw that F5 see the ARP request but never replays.

[lb:Active]~ # tcpdump -s0 -l -nn -i 0.0:nnn arp or icmp or host 10.176.4.37 | grep --color=always 10.176.4.37
15:14:26.465108 arp who-has 10.176.11.111 tell 10.176.4.37


I tried to ping the cloud server from F5 and to my surprise it worked fine.

Even at some point shortly after I pinged the cloud server from F5 successfully I was able to see ICMP requests from the cloud server but still F5 was not responding back. That means once the cloud server learned the MAC address of its default gateway it stopped sending the ARP requests and proceed to sending ICMP requests next, as expected.

The cloud server learned the MAC of its default gateway for a while but as the entry timeouted there were only ARP request on the wire again.

[lb:Active] ~ # tcpdump -s0 -l -nn -i 0.0:nnn arp or icmp or host 10.176.4.37 | grep --color=always 10.176.4.37
15:20:35.309694 IP 10.176.4.37 > 10.176.11.111: ICMP echo request, id 36453, seq 1, length 64 in slot1/tmm0 lis= flowtype=0 flowid=0 =00000000:00000000:00000000:00000000 remoteport=0 localport=0 proto=0 vlan=0

By looking further at the F5 configuration I found that there was a missing filter to allow this traffic. It is important because the default RC implementation is dropping any traffic sent from the cloud network to the Rackconnect device (F5 for my customer). Below is a example of a filter that was needed for the ping to start to work.

Example of a missing filter

net packet-filter RCAuto-ID_8-NP_28780-CS_57663-GW_62497 {
    action accept
    order 8
    rule "( src host 10.176.4.37 ) and ( dst host 10.176.11.111 )"
}

The easy way to create it is to go to MyRackspace portal and add all the basic network polices under the Rackconnect. By adding these network policies the Rackconnect system generated number of rules for the F5 and one of them was the missing filter described above:

Basic Access Configuration Policy 1 CLOUD SERVERS CLOUD SERVERS ANY ALL  
Basic Access Configuration Policy 2 CLOUD SERVERS DEDICATED ANY ALL
Basic Access Configuration Policy 3 CLOUD SERVERS INTERNET ANY ALL    
Basic Access Configuration Policy 4 DEDICATED CLOUD SERVERS ANY ALL

This resolved the issue with the default gateway but I was still unable to ping 8.8.8.8. Once again  looking at the tcpdumps on F5 I saw traffic hitting F5. The loadbalancer never forwarded it further. The traffic was simply dropped on the incoming VLAN.

To understand why this is happening it is important to know that by default F5 will drop any traffic unless there is a configuration object to handle it. In my case there were no specific virtual servers, NAT, SNAT that could handle the traffic. We had only a forwarding VS. As the VS was enabled only on specific VLANS it was not used to handle the cloud traffic so all traffic was dropped.

ltm virtual VS-FORWARDING {
    destination any:any
    ip-forward
    mask any
    profiles {
        PROF-FASTL4-FORWARDING { }
    }
    translate-address disabled
    translate-port disabled
 vlans {   
        internal
        external
    }
    vlans-enabled
}

Once we resolved this and enable the VS on necessary vlans all issues were resolved ;).

No comments:

Post a Comment