The Art of the Refresh

Bryon Beilman and Jamy Libsack

To support the activities of a 24X7 production environment, we have developed a method for refreshing 200 GB databases without affecting the production environment. This work is normally associated with the tasks of a Database Administrator (DBA), but by using the disk array tools and simple shell scripts, we can do the refreshes faster and more reliably than the DBAs.

To ensure that the production environment remains reliable, it is kept separate from the development environment. The model used to integrate new functionality into the production environment separates development, staging, and then production. Additionally, there are two direct copies of the production database used for reporting and support. The important point is that there are five copies of the database used for different reasons, and many of them must be copied or “refreshed” from the current production database, so that reports and debugging can take place on an exact copy (or at least close), without disrupting production.

We needed a method to quickly and accurately duplicate the data from production into reporting every day. Before we show how it is done, it is important to describe the hardware and technology we used to make this possible.

Figure 1 shows a diagram of the hardware. There are three machines connected to an EMC disk array and another disk array via Fibre Channel controllers. There is a production machine, a reporting machine, and a coordinating machine. The EMC array uses software called TimeFinder, which is capable of producing a BCV, which is a Business Continuation Volume. It allows the two different volume sets within the array to be synchronized very quickly. The BCV could be used as a snapshot of the current production environment for disaster recovery or backups. We are going to use it to create a new database with all the existing production data. The HP systems utilize the logical Volume Manager, which allows disk volumes to be exported and imported into a particular host.

The HP machines support the Peoplesoft application and utilize the Informix database. The methods presented here would work using Sun servers with Veritas Volume Manager and Oracle as well.

The EMC disk arrays were chosen because of their speed and the BCV functionality. There are other methods to do the refresh with similar results, and there may be other disk technologies that can do the same thing, so this process should be able to be easily customized. Table 1 shows the other methods of refreshing the databases that we use and their pros/cons. When analyzing the processes shown later, the BCV can be replaced by the methods in Table 1, and the remaining logic will still work. The dd method, which is a block level disk-to-disk copy, file restore, and export/import can be used between different disk arrays, which is an advantage.

Other Methods

Using the dd utility, the disk volumes can be replicated across arrays as long as the devices on each array are the same exact size. Using the example in Figure 1, the control host could be used to implement dd, with input file on the production array and the output going to the reporting array. This is one of the fastest methods to copy data between devices using generic UNIX tools. The primary disadvantage is that both the source and destination disks must be quiet, and no databases can be running.

A database tape restore is an excellent way to validate that the backups taken on a regular basis can indeed be restored. The results achieved in the above table were done using multiple DLT tape drives streaming in parallel. Only the destination devices need to be quiescent, but it is a slower method.

One additional way to refresh, that DBAs can use, is to export the data and schema to a file or files. These files are later read in to recreate the database. This allows selective filtering of data if needed, but takes the extra disk space for the export files. Because it is an application-level method, it will not be as fast as a raw I/O copy.

How It is Done

The nightly refresh script is performed from the Cntrl_host via cron and is connected to the same arrays and networks that the production and reporting system are using. The high-level method is shown as follows.

1. Unload the PeopleSoft database tables from the reporting database needed for security issues. This will be used later to recover the database after the refresh.
2. Stop the application servers, process schedulers, and database on the destination server and release the volume groups.
3. Synchronize the disks using BCVs.
4. Block the production database (pause it so transactions do not occur on production) and split the BCV. Once it is split, unblock the production database to continue processing.
5. Bring the refreshed volumes online on the destination server.
6. Start the reporting database on the destination server and change the table names from production to reporting.
7. Copy any surrounding applications files and query scripts that should accompany the database.
8. Load the PeopleSoft tables needed for security issues (previously unloaded in Step 1).
9. Start the Application Servers, then the Process Schedulers for the destination server.

By using the control host as the central coordinator, it can be the central point for many refreshes, including ones utilizing dd instead of BCVs. Because the devices can be exported and imported to and from various machines, the control host can take advantage of this to do the necessary work.

The whole process, which includes taking the applications and database both down and up, takes, on average 50 minutes here. The production database is blocked for about 90 seconds, and no transactions are lost. To keep track of how the process does nightly, many time stamps are inserted in the script. These are used to establish baselines for each section of the refresh script. When the process takes longer than expected, the source of the delay can easily be determined. The scripts are designed to be simple, with few moving parts, so they will be as reliable as possible.

The script (Listing 1) is called via cron using the following entry:

05 0 * * 0,1,2,3,4,5,6  /base/refresh.ksh | tee \
  /base/logs/refresh_db.log |
mailx -s "The reporting database was refreshed "
notify@yourdomain.com

There are various database tools within this script that are called, but the focus of this article is on the surrounding scripts and disk tools used to do this refresh quickly and efficiently. Stopping and starting the database, application servers, and process schedulers can vary, and the scripts are available via a Web site (www.sinecurve.com/refresh), but not specifically covered in this article. (Listings for this article are available at Sys Admin’s Web site: http://www.sysadminmag.com.)

Conclusion

We synchronize 200 GB of data every night in less than an hour, with a 90-second pause in production where no data or transactions are lost. This allows the reporting application and support teams to test against a current snapshot of production. The actual data synchronization takes about 15 minutes, and the starting and stopping of the environment takes approximately 30 minutes on the destination database. Using alternate hardware and method, we can synchronize it in 5 hours, which is necessary for machines not utilizing EMC hardware. Overall, this process can be automated.

About the Author

Bryon Beilman is a Senior Consultant with Collective Technologies. Bryon received a BSEE from the University of Colorado and immediately went to work as a UNIX systems administrator for StorageTek. He has worked for a number of companies in Colorado and the Silicon Valley. He has been doing systems administration for 10 years and has been consulting for Collective Technologies for the past two years. He can be reached at: beilman@colltech.com.

Jamy Libsack is a Senior Consultant with Collective Technologies. He received a BSME from Colorado State University and started his systems administration career at Shell Oil. Jamy has provided consulting expertise for numerous companies in the New England area and has been a systems administrator for 8 years. He is an expert in HP-UX and large storage arrays such as EMC. Jamy has been working for Collective Technologies for 5 years. He can be reached at: libsack@colltech.com.