apr2002.tar

Geographic Failover for Cheapskates

Rob King

I work for a company whose network carries real-time financial data between banks and customers, so it's absolutely necessary that the network be up 24 hours a day. To ensure reliability, even if the building itself were to burn down, fall over, and sink into the swamp, the company created a Fallback Center. The original plan (before I was hired) was to have the Fallback Center in the data center of our primary ISP. However, what if the ISP were to sink into the swamp as well? The company had also made arrangements that if the primary center went offline, someone would have to call the ISP and have them manually change routing tables to point to the new machines -- the problem with this should be obvious. The problem for me became how to do failover between two locations, on two different ISPs, without any single points-of-failure. We could have had DNS servers at a third location (or two third parties), which would return round-robin answers for all the machines on the network. This wouldn't do though, because we wanted failover, not load-balancing. We could have purchased a DNS-based failover box, but what if this box were cut off from the network? Besides, they're expensive!

The Solution

I registered our primary and secondary DNS servers with the appropriate authority. The primary DNS server would be at the primary location, in the primary location's IP space. The secondary DNS server would be at the Fallback Center, in the Fallback Center's IP space. The secondary would be slaved to the primary, and the primary DNS server's records would point to the primary location's network. The secondary DNS server, because it's slaved, would also point to the primary network.

I then created a script (run by cron on the secondary server) that performs nslookups on a known good host, using the primary as a server. Then, if it can't resolve the known host, the script copies a set of DNS maps (which point to the machines at the Fallback Center) over the original DNS maps and does an ndc restart. Since the primary DNS server is unreachable, all queries must now go to the secondary DNS server, which now points to the Fallback Center. See Listing 1. (All code listings for this article can be found at: http://www.sysadminmag.com/code/.)

An Example

Assume you have a Web server at the primary location, with the IP address 192.168.10.10. You also have a fallback Web server at the fallback location, with the IP address 192.168.20.20. The nameservers point www.sysadminexample.com to 192.168.10.10. A user goes to www.sysadminexample.com and gets 192.168.10.10. So far, so good. Suddenly, the sys admin spills his caffienated-beverage-of-choice on the 7206 and the primary center goes offline. The DNS server at the fallback center sees this and copies the DNS maps to the fallback center over the originals, and restarts the nameserver. Another user tries to go to www.sysadminexample.com. The primary namserver times out, and the user automatically queries the secondary nameserver and gets 192.168.20.20. Problem solved. Local DNS caches could be a problem (but that's a "reload" away) and affect any sort of DNS-based failover. Listing 2 is the script to run when you're ready to fail back to the Primary Center.

Another option is to set up an SNMP daemon on the primary namserver, have machines send SNMP traps for critical issues, and have a script to stop the primary nameserver. You could theoretically use this for any number of Fallback Centers, though the timeout would get to be a problem. To use this for time-of-day routing, place a script in cron on the primary to stop the nameserver at certain times.

I hope this helps you sleep a little better at night. It serves our purposes well, although we thankfully haven't had to use it in a real failover...yet.

Rob King has been a UNIX/Network Administrator37 for four years. He currently is the Network Administrator38 for a banking firm in the Grand Duchy of Luxembourg. Rob would also like to say "hi" to the lovely Lesley.