Geographic
Failover for Cheapskates
Rob King
I work for a company whose network carries real-time financial
data between banks and customers, so it's absolutely necessary
that the network be up 24 hours a day. To ensure reliability, even
if the building itself were to burn down, fall over, and sink into
the swamp, the company created a Fallback Center. The original plan
(before I was hired) was to have the Fallback Center in the data
center of our primary ISP. However, what if the ISP were to sink
into the swamp as well? The company had also made arrangements that
if the primary center went offline, someone would have to call the
ISP and have them manually change routing tables to point to the
new machines -- the problem with this should be obvious. The
problem for me became how to do failover between two locations,
on two different ISPs, without any single points-of-failure. We
could have had DNS servers at a third location (or two third parties),
which would return round-robin answers for all the machines on the
network. This wouldn't do though, because we wanted failover,
not load-balancing. We could have purchased a DNS-based failover
box, but what if this box were cut off from the network? Besides,
they're expensive!
The Solution
I registered our primary and secondary DNS servers with the appropriate
authority. The primary DNS server would be at the primary location,
in the primary location's IP space. The secondary DNS server
would be at the Fallback Center, in the Fallback Center's IP
space. The secondary would be slaved to the primary, and the primary
DNS server's records would point to the primary location's
network. The secondary DNS server, because it's slaved, would
also point to the primary network.
I then created a script (run by cron on the secondary server)
that performs nslookups on a known good host, using the primary
as a server. Then, if it can't resolve the known host, the
script copies a set of DNS maps (which point to the machines at
the Fallback Center) over the original DNS maps and does an ndc
restart. Since the primary DNS server is unreachable, all queries
must now go to the secondary DNS server, which now points to the
Fallback Center. See Listing 1. (All code listings for this article
can be found at: http://www.sysadminmag.com/code/.)
An Example
Assume you have a Web server at the primary location, with the
IP address 192.168.10.10. You also have a fallback Web server at
the fallback location, with the IP address 192.168.20.20. The nameservers
point www.sysadminexample.com to 192.168.10.10. A user goes
to www.sysadminexample.com and gets 192.168.10.10. So far,
so good. Suddenly, the sys admin spills his caffienated-beverage-of-choice
on the 7206 and the primary center goes offline. The DNS server
at the fallback center sees this and copies the DNS maps to the
fallback center over the originals, and restarts the nameserver.
Another user tries to go to www.sysadminexample.com. The
primary namserver times out, and the user automatically queries
the secondary nameserver and gets 192.168.20.20. Problem solved.
Local DNS caches could be a problem (but that's a "reload"
away) and affect any sort of DNS-based failover. Listing 2 is the
script to run when you're ready to fail back to the Primary
Center.
Another option is to set up an SNMP daemon on the primary namserver,
have machines send SNMP traps for critical issues, and have a script
to stop the primary nameserver. You could theoretically use this
for any number of Fallback Centers, though the timeout would get
to be a problem. To use this for time-of-day routing, place a script
in cron on the primary to stop the nameserver at certain
times.
I hope this helps you sleep a little better at night. It serves
our purposes well, although we thankfully haven't had to use
it in a real failover...yet.
Rob King has been a UNIX/Network Administrator37 for four years.
He currently is the Network Administrator38 for a banking firm in
the Grand Duchy of Luxembourg. Rob would also like to say "hi"
to the lovely Lesley.
|