Cover V10, I05
Article
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Sidebar

may2001.tar


Host-Based Replication and Oracle

Jim McKinstry

As more and more companies enter the e-business arena and, hence, global markets, there has become a greater need for true 24x7 access to data. The days of "9-5" databases with 8- to 12-hour backup windows are all but gone. This article describes how we implemented a 24x7 system to support the various needs of a zero-downtime database application using Veritas Storage Replicator and the snapshot feature of Veritas Volume Manager.

Requirements

Our client had all the typical needs of a production database:

1. The ability to back up Oracle with zero downtime and no performance degradation on the production system

2. The ability to run reports on relatively fresh Oracle data with no performance impact or downtime to the production system

The implemented solution can also be used to solve the following common business needs:

1. A viable, inexpensive disaster recovery solution for local, minor disasters (e.g., production server dies) as well as a solution that can easily be rolled out to remote sites

2. QA/Development server

Although a single Oracle database can meet some of these requirements, performance can severely drop during backups and while running reports. The solution we implemented does not negatively impact the performance of the production database during backups, reports, or any QA/Development work.

Hardware Configuration

Before describing the software solution we implemented, I will describe the hardware configuration (see Figure 1). We had two Sun servers, a 4500 ("production") and a 450 ("reporting"). The 4500 had 8 CPUs and 8 GB of RAM. The 450 had 4 CPUs and 4 GB of ram. Each system had 4 Fibre Channel host-bus adapters (HBAs), allowing each system to quad-attach into the MTI SAN. The MTI SAN consisted of two Vivant V20s. Each V20 was populated with 60 18-GB dual-ported Fibre Channel drives. Each V20 had two active-active RAID controllers, with 1 GB of mirrored cache. Each RAID controller was dual-attached to a pair of cross-connected, 16-port Fibre Channel switches (these are the switches to which the host systems are connected).

The SAN storage was configured in the following way (see sidebar "RAID Levels"):

  • 13 4-drive RAID 10 stores for Oracle data
  • 10 6-drive RAID 5 stores for replicated data and "snapshot" data
  • 2 2-drive RAID 1 stores for Redo logs
  • 2 hot-spares per Vivant

Each MTI hardware-based RAID 10 store was carved into one (or more) LUNs (the host sees each LUN as a disk device). For optimal performance, again using Veritas Volume Manager, the client was able to build file systems striped (RAID 0) across multiple LUNs, over multiple HBAs, and across multiple MTI RAID controllers (Figure 2). This is commonly called RAID 100 (RAID 10 and RAID 0 combined). The production database on the 4500 was laid out on these RAID 100-based file systems. (I'll further discuss the RAID 5 stores when I discuss the replication solution.) The client used Veritas DMP on both servers to allow host-bus adapter failover between HBAs to protect against HBA failure, broken fiber cables, etc. DMP also provides load balancing between the two controllers.

Solution

The underlying requirement of this problem was that the solution had to be relatively transparent to the production server (i.e., no production database downtime). Here's the flow of the implemented solution:

1. The production server ("production") replicates the Oracle database to the reporting server ("reporting").

2. "Reporting" receives the replication on software mirrored disks (Figure 3). (This is where the RAID 5 stores on the Vivants come into play. The RAID 5 stores were carved into LUNs and used to hold the replicated data. These volumes were also mirrored (RAID 1) using Veritas Volume Manager. This combination of RAID 5 and RAID 1 is called RAID 51 (mirrored RAID 5 volumes). RAID 5 is traditionally not used for applications with a high percentage of writes (the receiving end of replication is 100% writes). However, in this case, the combination of the way the network delivers the data, the nature of the database IOs (small block), and MTI's write-gathering cache algorithms allowed RAID 5 to be used with no performance problems).

3. "Split the mirror" on "replication".

4. Mount the split-mirrored volumes on "replication" (Figure 4).

5. Use the mounted volumes for backups or bring up Oracle on these mounted volumes for reporting.

Gory Details

We used Veritas Storage Replicator for Volume Manager (SRVM) to implement the replication from "production" to "reporting". Replication must first be configured on the receiving end ("reporting"), and then on the sending end ("production"). Take the following steps, assuming that the disk groups are defined on the sending and receiving servers and the volumes are created on the sending server (i.e., Oracle is up and running):

On the Receiving (Secondary) Server: Reporting

I recommend scripting the following steps so that your testing will be easier.

1. Start by making the data volumes to hold the data from each volume being replicated. This must be done once for each volume in the disk group that is being replicated. These volumes must be greater than or equal to the size of the volumes from which they are being replicated:

# vxassist -g disk_group_name make 
# volume_name volume_size
ex:
vxassist -g datadg make vol01 4G
vxassist -g datadg make vol02 4G
            

2. Make the Storage Replicator Log (SRL) volume. Each write done on the primary system is stored in the SRL and then sent to the secondary. This allows the primary to keep running and to store any data that may get backed up due to network contention. It also stores updates on the primary server should the network drop between the primary and secondary. The SRL volume on the secondary is only used in recovery situations. Sizing the SRL can be very involved, so I recommend reading the manual for this. The goal when sizing the SRL is to avoid overflow:

# vxassist -g disk_group_name make 
# srl_name srl_size
ex:
vxassist -g datadg make datadg_srl 1G

3. Make the RLINK. The RLINK defines the communication link between the primary server to the secondary server. A primary may speak to many secondary rlinks (i.e., many secondary servers), but a secondary rlink may only be associated to one primary server:

# vxmake -g disk_group_name rlink rlink_name \ 
   remote_host=remote_host_name remote_dg=remote_disk_group_name \
   remote_rlink=remote_rlink_name local_host=local_host_name \
   synchronous=off
ex:
vxmake -g datadg rlink datadg_to_primary remote_host=production \
   remote_dg=datadg remote_link=datadg_to_secondary \
   local_host=replication synchronous=off
4. Make the Replicated Volume Group (RVG) (and associate the RLINK, data volumes, and SRL with the RVG). There is one RVG per disk group being replicated. Every volume in the disk group being replicated should be included in this command (although volumes can be added to an RVG later as well):

# vxmake -g disk_group_name rvg rvg_name rlink= rlink_name \ 
   datavol=volume_name srl=srl_name primary=false 
ex: 
vxmake -g datadg rvg datadg_rvg rlink=datadg_to_primary \ 
   datavol=vol01, vol02 srl=datadg_srl primary=false 
5. This command "turns on" the link between the primary and secondary servers. At this point, the RLINK becomes enabled, but not connected, because the primary server has not been connected:

 
# vxrlink -g disk_group_name att rlink_name 
ex: 
vxrlink -g datadg att datadg_to_primary 
6. This step allows the volumes to start receiving data (again, no data is being sent at this point because the primary is not configured):

 
# vxrvg -g disk_group_name start rvg_name 
ex: 
vxrvg -g datadg start datadg_rvg 
The process is half done. Now, let's set up the primary side of the replication.

On the Sending (Primary) Server: Production

Note that the following steps are very similar to the steps performed on the secondary server.

1. Make the data volumes that will be replicated. This step is skipped if you are going to replicate filesystems that already exist.

 
# vxassist -g disk_group_name make volume_name volume_size 
ex: 
vxassist -g datadg make vol01 4G 
vxassist -g datadg make vol02 4G
2. Make the SRL. The SRL on the primary server needs to be carefully sized. More is better:

# vxassist -g disk_group_name make srl_name srl_size
ex: 
vxassist -g datadg make datadg_srl 1G 

3. Make the RLINK.

# vxmake -g disk_group_name rlink rlink_name \ 
  remote_host=remote_host_name remote_dg=remote_disk_group_name \ 
  remote_rlink=remote_rlink_name local_host=local_host_name \ 
  synchronous=off 
 ex: 
  vxmake -g datadg rlink datadg_to_secondary \ 
  remote_host=replication remote_dg=datadg \ 
  remote_link=datadg_to_primary local_host=production \ 
  synchronous=off 

4. Make an RVG, and associate the RLINK, data volumes, and SRL with the RVG:

# vxmake -g disk_group_name rvg rvg_name rlink= rlink_name \
  datavol=volume_name srl=srl_name primary=true
ex:
vxmake -g datadg rvg datadg_rvg rlink=datadg_to_secondary \
  datavol=vol01, vol02 srl=datadg_srl\ primary=true
5. Attach the RLINK to the RVG:

# vxrlink -g disk_group_name att rlink_name
ex:
vxrlink -g datadg att datadg_to_secondary
6. Start the RVG:

# vxrvg -g disk_group_name start rvg_name
ex:
vxrvg -g datadg start datadg_rvg
At this point, we are replicating!

Important notes about the previous steps:

1. The secondary server must be set up first.

2. In Step 1, the volumes on the secondary server must be greater than, or equal to, the size of the volume on the primary server.

3. Disk group names and volume names should be identical on the primary and secondary servers. It's much easier this way, less confusing, and allows Oracle to come up on the receiving end more smoothly.

4. The documentation states that SRVM doesn't support RAID 5 volumes, I used RAID 5 volumes in this solution. The documentation is referring to software-based RAID 5 volumes. I used hardware-based RAID 5 with cache. This is supported because it is a fast RAID 5 implementation.

5. Keep disk group names, volume names, etc. simple. I recommend something like:

  • datadg -- the disk group that holds Oracle data. If you have more than one disk group for data, use datadg1, datadg2, etc.
  • redodg -- the disk group that holds Oracle redo logs. Again, redodg1, redodg2, etc., if necessary
  • archivedg -- the disk group that holds Oracle archive logs. Again, archivedg1, archivedg2, etc., if necessary
  • vol0x -- where x=1, 2, 3, etc., as necessary
Synchronize the Servers

Now that replication is set up for each volume, we need to synchronize the two systems. SRVM does block level replication, so we need to synchronize each block. We did this by running the following command one the primary (once for each replicated volume). It's best if the file systems are un-mounted and there is no activity on the volumes at all:

dd if=raw_device_name_of_volume of= raw_device_name_of_volume bs=32k
Note that if and of are the same. This is because dd will, block by block, read a block then write back over itself. This will send the update over the replicated link synchronizing the primary and secondary servers.

At this point, Server A and Server B are completely in sync. We could have stopped here. To bring up Oracle on the reporting server, we would have to stop the replication and mount the file systems on the replicated volumes. This is fine except when we want to refresh the reporting database, we'd have to completely repeat the re-synchronization steps. Even if Oracle is brought up in "read-only" mode, it still writes some stuff to disk, which puts the primary and secondary servers out of sync. This would mean extended downtime for the production database because the re-sync process can take a while. As long as the volumes on the secondary server are not modified (i.e., written to), replication can be stopped and started without performing a full resynchronization. If we were just performing backups, then this would work. We could mount the replicated volumes as read-only, do the backup, unmount the volumes, and restart the replication. The changes that took place on the primary while the replication was split would then be applied to the secondary, and replication would continue as normal.

Split-Mirroring

To avoid full resynchronization each time, we brought up Oracle on the replicated data and implemented a split-mirror solution. Conceptually, this was done by mirroring the volumes that hold the replicated data. To bring up Oracle, we'd "split the mirror", mount the file systems, and bring it up. To refresh the data, we'd "remirror" the volumes. When the remirror was complete, we could split the mirror again and run reports on fresh data.

FMR

But wait; if I mount Oracle on a split-mirrored volume, when I re-mirror the volume, then a complete synchronization would have to take place. Although this is faster than re-synchronizing using replication, it is still unacceptable. This is where FMR (Fast Mirror Resynchronization) comes into play. When a mirror is split, an FMR map is created in memory. The FMR map is a bitmap, one bit per block that tracks when a block has been changed. It doesn't matter if the block was changed on the primary volume or the split volume. When the mirror is rejoined, the FMR map is scanned. Each block that was marked as changed is then copied from the primary volume to split volume (Figures 5 and 6). This is usually very quick and when complete, the volumes are completely mirrored.

The block size represented by a bit in the FMR map depends on the size of the volume being mirrored. The maximum size of the FMR map is 8 KB, which is 8192 bytes or is 65536 bits. For a 1-GB volume, the block size is 16-KB (1 GB = 1024 MB = 1048576 KB. 1048576KB/65536 bits = 16 KB/bit). For a 4-GB volume, the block size is 64 KB. I think that an 8-KB limit for an FMR map is too small. The maximum block size for the FMR map should be equal to the block size of an I/O on the system (usually 4-8 KB). Also, the larger the FMR map, the more granular the block changes can be tracked. I'd rather do a larger scan on the FMR map (very fast) to perform less (slow) I/Os. In the case of this article, we replicated/mirrored 16 volumes, which means that we used 128 KB of our 4 GB of RAM.

Note that if the system is rebooted (on purpose or not) while the mirror is split (i.e., the FMR map is in existence) then the next time the mirror is rejoined, a complete resynchronization would take place. This is because the FMR map is held in memory. If that map is lost, then the system has no idea what blocks are changed and will update them all. You can't protect yourself against a crash, but you do control a normal shutdown. Part of the shutdown process should be to rejoin the mirrors (add a "K" script in /etc/rc2.d) so that they are in sync at boot.

Setting It Up

Creating the mirrors of the replicated volumes is easy. Start by making sure that each diskgroup used for replication on the secondary has an unused disk assigned to it for the mirror. Then do the following for each volume:

1. Turn on the Fast Mirror Resynchronization (FMR) option. This only needs to be done once per volume:

vxvol -g disk_group_name set fmr=on volume_name
ex:
vxvol -g datadg set fmr=on vol01
2. Create the initial mirror. This takes a while because this is when the initial synchronization occurs:

vxassist -g disk_group_name snapstart volume_name
ex:
vxassist -g datadg snapstart vol01
Now the mirrors are created and synchronized. The mirrored volume, by default, is called SNAP-volume_name (i.e., SNAP-vol01). To split a mirror:

vxassist -g disk_group_name snapshot volume_name
ex:
vxassist -g datadg snapshot vol01
To resynchronize a mirror (this is where FMR comes in to play):

vxassist -g disk_group_name snapback SNAP-volume_name
ex:
vxassist -g datadg snapback SNAP-vol01
Putting It All Together

Because we were plugging existing production servers into the MTI Vivant SAN, we were able to implement this setup with minimal downtime. We brought each server down long enough to add the Fibre Channel HBAs. After the HBAs were in the servers, we did not have to bring them down again. Oracle continued to run on the existing storage that was in the Sun systems. Once plugged into the SAN, we configured the disk groups and volumes on the primary server. We set up SRVM on the secondary server and then on the primary server. We then synchronized the primary and secondary servers using the dd command and finished by mirroring the replicated volumes on the secondary server. All of this took a couple of hours. After all this was completed, we did some simple tests (copied a file onto one of the new file systems on the primary server and verified that it appeared on the secondary server). Again, at this point there is no downtime to the servers or Oracle.

When everything was configured, synchronized, and tested, we were able to move the Oracle data to the MTI SAN. We then brought down Oracle. Once down, we copied the Oracle files to their new locations and brought Oracle up on the new volumes. Leaving the original data where it was allowed us the safety of bringing Oracle up on the old storage in case there was a problem.

To bring up Oracle on the replication server, we had to:

1. Split the mirrors (once per volume):

vxassist -g disk_group_name snapshot volume_name
2. Fsck and mount each filesystem,

3. Start up Oracle and do a "recover database".

I scripted these steps, so it only takes seconds to complete. Essentially, we've created a "system crash" scenario. Because we didn't bring down the production Oracle database to do the snapshot on each volume, we don't have a consistent image of the database. The "recover database" command recovers from redo logs/archive logs, and the database comes up fine (as if the system had crashed and rebooted). I played around with putting the production database in "hot-backup" mode and then doing the snapshot on the reporting server. When we did this, we were unable to open the "snapshotted" database. We realized that the "snapshotted" database was in hot backup mode. This makes perfect sense because we took the snapshot while the production database was in hot backup mode. To use the "snapshotted" database, we had to do a "recover database" on it. Because we were doing a "recover database" in either scenario, we took the easier, faster, and safer solution and just shot the replicated copy without touching the production database.

Summary

With this project, we implemented a system that the client could use to back up production databases, run reports, and prepare for disaster recovery without impacting the availability and performance of the production server. This solution, while currently occupying a couple of racks in their data center, can now be used to roll out a disaster recovery site or a remote reporting site.

Jim McKinstry is a Senior Sales Engineer for MTI Technology Corporation (www.mti.com). MTI is a leading international provider of data storage management products and services. He can be reached at: jrmckins@yahoo.com.