|  Geographic 
              Failover for Cheapskates
 Rob King
              I work for a company whose network carries real-time financial 
              data between banks and customers, so it's absolutely necessary 
              that the network be up 24 hours a day. To ensure reliability, even 
              if the building itself were to burn down, fall over, and sink into 
              the swamp, the company created a Fallback Center. The original plan 
              (before I was hired) was to have the Fallback Center in the data 
              center of our primary ISP. However, what if the ISP were to sink 
              into the swamp as well? The company had also made arrangements that 
              if the primary center went offline, someone would have to call the 
              ISP and have them manually change routing tables to point to the 
              new machines -- the problem with this should be obvious. The 
              problem for me became how to do failover between two locations, 
              on two different ISPs, without any single points-of-failure. We 
              could have had DNS servers at a third location (or two third parties), 
              which would return round-robin answers for all the machines on the 
              network. This wouldn't do though, because we wanted failover, 
              not load-balancing. We could have purchased a DNS-based failover 
              box, but what if this box were cut off from the network? Besides, 
              they're expensive!
              The Solution
              I registered our primary and secondary DNS servers with the appropriate 
              authority. The primary DNS server would be at the primary location, 
              in the primary location's IP space. The secondary DNS server 
              would be at the Fallback Center, in the Fallback Center's IP 
              space. The secondary would be slaved to the primary, and the primary 
              DNS server's records would point to the primary location's 
              network. The secondary DNS server, because it's slaved, would 
              also point to the primary network.
              I then created a script (run by cron on the secondary server) 
              that performs nslookups on a known good host, using the primary 
              as a server. Then, if it can't resolve the known host, the 
              script copies a set of DNS maps (which point to the machines at 
              the Fallback Center) over the original DNS maps and does an ndc 
              restart. Since the primary DNS server is unreachable, all queries 
              must now go to the secondary DNS server, which now points to the 
              Fallback Center. See Listing 1. (All code listings for this article 
              can be found at: http://www.sysadminmag.com/code/.)
              An Example
              Assume you have a Web server at the primary location, with the 
              IP address 192.168.10.10. You also have a fallback Web server at 
              the fallback location, with the IP address 192.168.20.20. The nameservers 
              point www.sysadminexample.com to 192.168.10.10. A user goes 
              to www.sysadminexample.com and gets 192.168.10.10. So far, 
              so good. Suddenly, the sys admin spills his caffienated-beverage-of-choice 
              on the 7206 and the primary center goes offline. The DNS server 
              at the fallback center sees this and copies the DNS maps to the 
              fallback center over the originals, and restarts the nameserver. 
              Another user tries to go to www.sysadminexample.com. The 
              primary namserver times out, and the user automatically queries 
              the secondary nameserver and gets 192.168.20.20. Problem solved. 
              Local DNS caches could be a problem (but that's a "reload" 
              away) and affect any sort of DNS-based failover. Listing 2 is the 
              script to run when you're ready to fail back to the Primary 
              Center.
              Another option is to set up an SNMP daemon on the primary namserver, 
              have machines send SNMP traps for critical issues, and have a script 
              to stop the primary nameserver. You could theoretically use this 
              for any number of Fallback Centers, though the timeout would get 
              to be a problem. To use this for time-of-day routing, place a script 
              in cron on the primary to stop the nameserver at certain 
              times. 
              I hope this helps you sleep a little better at night. It serves 
              our purposes well, although we thankfully haven't had to use 
              it in a real failover...yet.
              Rob King has been a UNIX/Network Administrator37 for four years. 
              He currently is the Network Administrator38 for a banking firm in 
              the Grand Duchy of Luxembourg. Rob would also like to say "hi" 
              to the lovely Lesley.
           |