Search This Blog

Showing posts with label high availability. Show all posts
Showing posts with label high availability. Show all posts

Friday, March 14, 2014

Interface redundancy on the host with TCP Multipath

TCP and UDP protocols are used exchange data between hosts. They have been used for a decade or longer and are very well documented how they work.

Everyone knows the problem that when you lost your active link on the server all your TCP sessions are going to die as well. Let's say your server has 2 active interfaces. There is no way to move/migrate a TCP session to use another active interface (by default). The other link can't be used automatically as a fail back mechanism.

There are couple of reasons behind why it isn't to works, the simplest one is that the new link used a different IP address. Even if the Linux kernel would start using the new interface and start sending IP/TCP packets sourced with the new IP address these packets wouldn't be recognized on the remote site. The remote site expect tcp segments from one and only one IP source.

Problem

How to provide a link level redundancy on the server to keep a TCP session alive even if one interface experience an error.

Analysis and solution Demonstration

The problem could be see as a more generic issue: how to implement multihoming or link redundancy. There are couple of working solution out there. The simplest example:
  • Link bonding(link aggregation) on the server; requires support and proper configuration on the switch and the server
We will look at another one: TCP Multipath. What is cool about this is that it is transparent to your application. It visualizes a session and provide a single TCP session to the application that can benefit from built-in multipath redundancy on the kernel level.


References

http://multipath-tcp.org/
Decoupled from IP, TCP is at last able to support multihomed hosts
https://devcentral.f5.com/articles/multipath-tcp-mptcp
https://devcentral.f5.com/articles/the-evolution-of-tcp


Wednesday, September 11, 2013

VLAN failsafe problem for F5 HA new builds

The F5 load balancers are powerful devices that support variety of high availability features. A full list  and description can be found at Configuring High Availability in TMOS Management Guide for BIG-IP Systems" document.

Supported HA features:

System fail-safe
Monitors the switch board component and a set of key system services.
Gateway fail-safe
Monitors traffic between the BIG-IP system and a gateway router.
VLAN fail-safe
Monitors traffic on a VLAN.
Problem

Both F5s in HA cluster fail over and go into standby mode when fail-safe is enabled on the VLAN that doesn't see any traffic.

Analysis and workaround descriptions

When you are building a new HA cluster this is not going to cause any major issues. Usually for new builds the cluster will be build without any servers behind the LTM devices. If both devices go into standby mode it may be surprising but a simple ping from one F5 to another should bring both them into standby/active again.

Unfortunately the issue can be experienced as well as in production when you are adding a new VLAN to a running HA cluster. If the VLAN has fail-safe enabled and if there are not devices behind both F5 LTMs the new VLAN may trigger the fail over on both nodes. As expected it is caused by the VLAN fail-safe when there is not traffic on the VLAN.

There are couple workarounds that can be applied, two examples are listed below.

Workaround 1

As per the SOL13297: Overview of VLAN failsafe (10.x - 11.x) set true to LTM data base variable failover.vlanfailsafe.resettimeronanyframe.
 
modify /sys db failover.vlanfailsafe.resettimeronanyframe value [true|false]

Workaround 2

Create a pool with all self IPs of the affected VLAN to allow the LTM to detect traffic on the new VLAN and to prevent fail-safe to kick in.

Sunday, August 12, 2012

High Availability trends and architectures in cloud computing

Every now and than there is a shift and change in IT industry. The changes try to provide a solution to our old problems and try to as well as predict and introduce new ideas. Many times it is not only a change in hardware itself or a change in software only but a mixture of both.

The Cloud is today the buzz word that drives the changes and powers the transformation. In my attempts to embrace and understand what it is I have found couple if videos on YouTube that give us a very nice inside view into the cloud what it is, how this works and what ideas it brings with. 

This is a link to the repository of all videos: http://www.youtube.com/user/TheCloudcastNET

The one I particularly like because it is discussing concept of High Availability in Cloud: