An
Economical Scheme for Quasi Real-Time Backup
Leo Liberti and Franco Raimondi
Backup is the most important systems administration task and often
neglected by sys admins. Restoring is another daunting task and,
during the lengthy restore process, the main server is unable to
serve, further pressuring the admin. With freeware, inexpensive
PC hardware, and high-quality UNIX-like operating systems, it is
possible to have a couple of standby servers ready to supply the
services when the main server crashes. One such redundancy scheme
is described in the article "Quick Network Redundancy Schemes"
(Leo Liberti, Sys Admin, April 2001: http://www.samag.com/documents/s=1152/sam0104a/0104a.htm).
But what use is a backup server if the data it is supposed to serve
is all stored on the crashed main server?
The solution is a real-time backup system on the standby servers
so that when the main server is unreachable, the standby server
automatically takes its place and is ready to supply the service
while the admin works with the crashed main server. Unfortunately,
real-time backup is a black beast in every way: it requires complicated
(proprietary) software, costly hardware, and various locations,
all linked by high-speed dedicated connections (see "Real-Time
Remote Data Mirroring" by Primitivo Cervantes, Sys Admin,
July 2001: http://www.samag.com/documents/s=1150/sam0106sc/0106c.htm).
In this article, we will discuss the implementation of a quasi
real-time backup scheme for high-availability systems that updates
the standby server's data disks every given time interval,
down to cron's minimum granularity (one minute), or alternatively
in a "respawn"-like continuous fashion. Because user data
tends to gather in large quantities, and for security considerations,
things are not as simple as running rsync as often as possible.
Our system relies on the replication of the relevant network segments
to supply a lot of bandwidth for the quasi real-time backup tasks,
and for this backup to happen in an isolated logical network.
How the System Works
This section provides the overall description of the system. We
envisage two computers -- A and B. Computer A is the main server,
and B is the standby backup server that provides the necessary redundancy.
Refer to my previous article on network redundancy schemes to set
up a standby server. We intend to keep B's data disks constantly
synchronized with A's data disks.
There are a lot of ways to transfer data over a network, most
notably NFS, FTP, rcp, rsync. All of these methods have advantages
and disadvantages. The main disadvantage with NFS is that it requires
a persistent network connection between client and server. Because
we are devising a solution for situations where the server has crashed,
it may lead to problems (e.g., broken NFS connections are infamous
for requiring long timeouts when servers are down). FTP is difficult
to automate, and FTP clients usually cannot deal with incremental
transfers (i.e., where only the differing parts of a file are transferred
in order for the file to be updated). rcp suffers from security
problems (as all r- services do), and it has the same drawback as
FTP in that it cannot deal with incremental transfers. For all these
reasons, we are going to work with rsync.
Although rsync is slightly less standard than NFS, FTP, and rcp,
it is widespread in most UNIX-like operating systems. In any case,
it is easy to download and compile (see http://www.rsync.org/).
The advantages of rsync include:
- It is easy to run rsync on top of SSH (secure shell), so that
transfers can be secure.
- It supports incremental file transfers. Because we intend to
continuously run rsync, it is crucial that the amount of transferred
data be kept to a minimum.
Running rsync on the existing local network is not wise. For one
thing, taking steps towards prevention of disaster does not mean
that disaster will actually occur, so clogging the local network
with data that may never be used is not the best idea. Also, the
amount of transferred data may be very large. This, of course, depends
on the network usage and cannot be generalized, but we don't
want a self-inflicted denial-of-service so that when the network
is under a heavy load, the amount of traffic doubles because of
our real-time backup. Furthermore, because we are handling user
data, running continuous transfers on the local network may expose
it to packet sniffing. Finally, using a separate network adds resiliency
to the system; the main network goes down but network backups are
still occurring.
This leads to the conclusion that data backup should take place
over a dedicated cable. Even though a network segment between computers
A and B must already exist on the local network, we are going to
replicate the network segment using two additional network cards
and a piece of network cable. Figure 1 shows the proposed network
layout. The data backup will take place on the replicated network
segment.
Hardware
We suggest using the same type of network hardware for the replicated
segments as that of the local network. If the local network runs
at 10 Mbps, it does not make sense to install two top-of-the-range
1-Gbps Ethernet cards on the segments. On the other hand, if you
choose to install 10-Mpbs replicated network segments within a faster
network, data to be backed up may arrive at a faster pace than the
segment capacity allows.
We run a 100-Mbps local network at Ipnos, so we installed two
PCI 100-Mbps network cards on computers A and B with a crossover
UTP cable between them. We recommend against using a hub or a switch
for point-to-point Ethernets such as our replicated network segments,
because in addition to the added cost with no added benefit, network
devices add a small performance penalty to network transfers when
compared with transfers over the raw cable.
With respect to the choice of hard disk storage, it is better
to install disks of the same make, model, and size on each of the
servers. However, this requirement is not absolute. It suffices
to keep an eye on the disk capacity of the smallest hard disk.
Network Configuration
There are basically two possibilities for integrating these network
segments into the existing network setup:
1. Let them participate to the same local network, letting ARP
and static routing take care of packets.
2. Create a separate network for each of the segments.
The advantage of the first choice is the ease of setup. However,
because we want to keep an eye on security issues, we want to enforce
a definite logical separation between the segments and the local
network. This is not the only advantage of logical network separation
-- extended resilience in case of failure of the local network
and greater ease of use with rsync are other advantages. Therefore,
we recommend using a separate network for each the replicated network
segments.
Suppose then that the local network is 192.168.1.0/255.255.255.0,
computer A (the main server) has IP 192.168.1.1 on device eth0,
and computer B (the standby server) has IP 192.168.1.2 on device
eth0. (We assume the operating system on the computers is Linux;
just change the device name accordingly if this is not the case).
After installing the second network card on each of the two computers,
we configure the network segment 10.168.1.0/255.255.255.0 with 10.168.1.1
on device eth1 for computer A, and 10.168.1.2 on device eth1 for
computer B. Refer to Linux NET-3-HOWTO or Net-HOWTO if you have
problems in setting up the network segments. Remember a crossover
UTP cable must be used when connecting the two network cards on
the segment.
Necessary Network Services
If the local network is completely separated from the rest of
the Internet, or if it is guarded by a strict, well-administered
firewall, we can use rsync client and daemon to transfer files.
These are the steps to follow:
1. Check that rsync is present on computers A and B. If not, download
and install it.
2. Install the rsync daemon service on computer A by adding the
line rsync stream tcp nowait root /usr/bin/rsync rsync --daemon
to the file /etc/inetd.conf on computer A.
3. Copy the following rsync daemon configuration file to /etc/rsyncd.conf
on computer A, and give it file permissions 600.
# on computer A
uid = root
gid = root
use chroot = no
max connections = 0
syslog facility = local5
pid file = /var/run/rsyncd.pid
read only = true
hosts allow = 10.168.1.2
hosts deny = 0.0.0.0/0.0.0.0
[home]
# path points to the user data directory
path = /home
comment = user data
auth users = realtime
secrets file = /etc/rsyncd.secrets
4. Reconfigure the running inetd daemon on computer A. As root:
A# killall -HUP inetd
5. Create the secrets file on computer A. As root:
A# echo realtime:mypasswd>/etc/rsyncd.secrets;chmod 600 /etc/rsyncd.secrets
6. Set up accessing rights on computer B. As the user that will run
the rsync client (may be root):
B# echo mypasswd > ~/.rsync-pwd ; chmod 600 ~/.rsync-pwd
7. Create the script that will read the data on computer A and transfer
it to computer B. Copy the following lines to the file /usr/local/sbin/realtimesync
on computer B, and give it file permissions 755.
#!/bin/sh
# on computer B
rsync -a --password-file ~/.rsyncd-pwd rsync://realtime@10.168.1.1/home/ $1
If we need a secure connection over the replicated network segment,
we can run rsync over SSH. Follow these steps:
1. Download, install, and configure SSH on computer A so that
unchallenged root logins are allowed from computer B (use the segment's
IP, 10.168.1.2, and see sshd man page, searching for the terms "PermitRootLogin",
"IgnoreRhosts", "RhostsAuthentication"). If
you don't want to allow unchallenged root logins from anywhere,
create a user that has the right to read all data files that are
going to be backed up and allow unchallenged root logins from that
user instead.
2. Install rsync on both computer A and B (if it's not already
there), and the SSH client on computer B.
3. Create the script that will read the data on computer A and
transfer it to computer B. Copy the following lines to the file
/usr/local/sbin/realtimesync on computer B, and give it file
permissions 755.
#!/bin/sh
# on computer B
rsync -a -e ssh 10.168.1.1:/home/ $1
To check that everything has been successful, try running the
script /usr/local/sbin/realtimesync from computer B, and
verify that it returns the complete list of data files in the /home
directory on computer B.
Performing the Backup
Having put everything in place, we are now ready to perform the
backup by running realtimesync on computer B, either in a
cron job or in a continuous fashion. In each case, we wrap the script
in another script, which checks that the previously launched script
has finished running. Create the script /usr/local/sbin/run_realtimesync
on computer B as follows, and give it permissions 755:
#!/bin/sh
# on computer B
ps ax | grep /usr/local/sbin/realtimesync | grep -qv grep || \
/usr/local/sbin/realtimesync /home/
Notice that the argument /home/ specifies the directory on
computer B to where data from A will be mirrored.
For a cron job, choose a time interval between 1 and 10 minutes
(or even higher if you like, but anything higher than 10 minutes
will make this real-time backup scheme sound more like seldom-time).
For a 3-minute time interval, edit the following line in the cron
system using the crontab command:
*/3 * * * * root /usr/local/sbin/run_realtimesync
For a continuous execution, modify the run_realtimesync script:
#!/bin/sh
# on computer B
while true ; do
ps ax | grep /usr/local/sbin/realtimesync | grep -qv grep || \
/usr/local/sbin/realtimesync /home/
sleep 10
done
Launch this script from one of the initialization files in the init.d
directory (look in /etc, /etc/rc.d or /sbin),
taking care to add an ampersand to launch in background:
run_realtimesync &
The sleep 10 command has the purpose of avoiding the overloading
computer B with an excessive checking using ps. It is the number
of seconds to wait before trying to spawn another backup process.
You can modify this parameter to as low as 1-second intervals.
Integration in Network Redundancy Schemes
The ideas described in this article can be effectively used in
conjunction with the quick network redundancy scheme described in
the "Quick Network Redundancy Schemes" article. In this
section, we discuss some of the issues arising from the combination
of the two schemes.
In the aforementioned article, network redundancy was obtained
by having two or more computers monitoring each other continuously.
When one of the computers failed to respond, another one would take
its IP address by means of IP aliasing. It would then spawn the
necessary daemons to supply the services which were originally supplied
by the crashed machine.
Diagnosing the Network Problem
In the above-cited article about redundancy schemes, the standby
server continuously monitors the main server by sending ping
probes. When the ping stops responding, the main server is
assumed to be down. However, we can now use the replicated network
segment to further diagnose the problem by having computer B continuously
ping'ing computer A on both network interfaces. If both
ping attempts fail, we can assume computer A is down. If
the local network ping fails but the replicated segment is
alive, we can wait some time and try ping'ing again
before taking action; it may be that the local network is under
heavy load.
Securing Communication
Some of the strategies computer B uses to try to resume service
on computer A (described in the above-cited article) rely on remote
unauthenticated root login. By running these commands on the replicated
network segment, it is less likely that somebody on the local network
could break into the network connection and steal data.
Automatically Curing Network Woes
If one of the ping probes fails (say the one on the local
network), we can take automated remote action by using the replicated
network segment. For example, we can trigger an automatic reboot
to see if things improve. Otherwise, we might try to restart the
network configuration. Likewise, if the replicated network segment
fails, we can take the same actions on the local network.
Bring Interfaces Down Prior to Takeover
When computer B has decided that computer A is really not able
to provide the service, it takes over by grabbing its IP number
(as described in the previous article about redundancy schemes).
If the replicated network segment is still alive, it is a good idea
to use SSH or RSH to bring computer A's local network interface
down prior to taking its IP. This way, even if the interface on
computer A starts working again, there will never be the problem
of having two interfaces with the same IP address over the same
local network.
Stopping Real-Time Backup Services
Before computer B effectively takes over the service from computer
A, all real-time backup services must be stopped. This can be obtained
by inserting the commands:
- chmod 644 /usr/local/sbin/realtimesync
- chmod 644 /usr/local/sbin/run_realtimesync
- killall run_realtimesync
- killall realtimesync
in the "take action" section of the monitor scripts
(see previous article), so that the real-time synchronization scripts
are not executable anymore.
After the Main Server Has Resumed Service
Suppose computer A has failed, and computer B has taken over.
Now the sys admin is left with the task of resuming services on
computer A. There must be some interruption of service during the
"restore data" phase for computer A (the main server),
but because we can't lose any user data, all data gathered
on computer B during computer A's downtime must be transferred
to computer A in an autonomous fashion (i.e., without any more user
data being stored on computer B). We always see this stage (resuming
services on computer A) as safer if conducted by a systems administrator,
rather than relying on automated procedures. Ideally, these should
be the steps:
1. Mend computer A and, while it's disconnected from the
local network, test it, and make sure it works.
2. Bring down the local network interface on computer A, but make
sure the replicated network segment is working.
3. Suspend customer services on computer B and put up a banner
on the Web site informing customers that service will be resumed
as soon as possible (of course you will have advised them about
this downtime by email at least two days prior).
4. Use FTP, NFS, or what have you to transfer the data files from
computer B to computer A. (Do not attempt to use rsync because the
rsync daemon on computer A is configured for read-only operation.
However, you may still use rsync on RSH or SSH.)
5. Bring down the local network interface on computer B corresponding
to computer A's IP address.
6. Bring up the local network interface on computer A (the service
has now been resumed and your customers can start working immediately).
7. On computer B, resume the monitoring services (by relaunching
the monitor script), and then remove the Web site banner advising
of the downtime. Resume the real-time backup services by setting
permissions 755 to the scripts /usr/local/sbin/run_realtimesync
and /usr/local/sbin/realtimesync. If you were using real-time
backup through the cron daemon, you are done. If you were using
continuous spawning of realtimesync, you must restart the
modified run_realtimesync script manually.
Conclusion
We have described procedures that allow for quasi real-time data
backup services between a main server and a standby replacement
server. The procedures are based on replicating the network segment
between the two servers. Unlike commercial full-fledged real-time
backup services, which rely on costly hardware and proprietary software,
the procedures described here require no further cost to implement
than that of a couple of spare network cards and a crossover cable.
Leo Liberti graduated in Mathematics from Imperial College,
London, UK, in 1995. He then received an M.Sc. in Mathematics from
Turin University and went on to do more research at Imperial College
and some part-time work as a systems administrator. He is a co-founder
of IrisTech S.r.l. (Italy -- http://www.iristech.net)
and Ipnos (UK -- http://www.ipnos.co.uk).
Franco Raimondi graduated in Physics from Milan University,
Italy, in 1997. He is a co-founder of IrisTech S.r.l. and Ipnos.
He has recently started a Ph.D. on Intelligent Software Agents at
the Department of Computing at Imperial College.
|