Cover V08, I08
Article
Listing 1
Table 1

aug99.tar


Storage Capacity Planning

Dave D. Zwieback

The storage subsystem suffers from a number of common misconceptions that typically cause it to be one of the least properly configured parts of the system. Perhaps it can be explained by the "black box" effect, where what actually happens inside a disk drive is not widely known and is hidden by the misleading parameters touted by the disk's manufacturer. Unfortunately, as I will demonstrate in this article, it is generally not enough to use the biggest and fastest "black box" and to make sure that it is connected to the system by the fastest and widest SCSI controller available.

A Web server needs to be configured differently than a database host; furthermore, a read-only database system should be configured differently from one that is write-intensive. The bottom line is that you must have clear objectives for a particular system before beginning capacity planning. In this article, I will provide the methodology to complete the planning as well as some tips and tricks along the way.

There are four specific areas that apply to storage that must be examined when performing capacity planning:

1. Capacity (ability to store the necessary amount of data)

2. Speed (ability to read or write the data at the required speed)

3. Redundancy/data integrity (having the requisite data integrity and availability)

4. Ability to grow

Note that technical details presented here are specific to Solaris; however, the concepts apply to most UNIX systems, and beyond.

Capacity

Perhaps one of the most misleading parameters of a disk drive (or tape) is its capacity. While most think of a megabyte (MB) as 1,048,576 bytes, the storage industry consistently uses MB to mean one million bytes. Moreover, a Gigabyte (GB) is not 1,073,741,824 bytes but one billion bytes. In today's large disks, this differential is quite noticeable: an 18-GB capacity reported on the label turns out to be almost 1.3 GB less than one would expect! In general, there is a 5-7% differential between the number on the disk label and the actual storage capacity in bytes.

Additionally, there is a considerable difference between the unformatted and formatted capacity of a disk drive due to the Error Correcting Codes (ECC) and other formatting information used by the drive. Make sure that the vendor is quoting the formatted capacity, since the difference may be up to 10%.

If that were not enough, once the disk is partitioned, a considerable part of the storage capacity is "lost" due to the overhead imposed by the file system. Thankfully, two key file system variables - inode density and minfree - are tunable, which allows you to "rescue" considerable disk resources.

By default, Solaris' newfs(1M) reserves 10% of the partition space for a free space reserve (minfree, configured by newfs -m). This reserve is used in emergency overflow situations, which is the reason that df sometimes reports file system space utilization is excess of 100%. While the reserve is extremely useful, on an 18-GB disk a whopping 1.8 GB would be allocated to it by default! It is generally more appropriate to set minfree to 1%, except on the root file system or on one that is very small.

The second important tunable parameter is inode density. Solaris uses one inode for every file or ACL in a file system. By default, newfs uses the inode density of 2 KB (one inode is allocated on the file system for every 2048 bytes of usable space). On an 18-GB disk, about 9,000,000 inodes would be placed on the disk by default. Such a large number of inodes may be necessary for a file system that is, for instance, used to store Usenet feeds. However, for most applications a much more reasonable inode density is required. Depending on the application, inode density should typically be in the range of 16-64 KB. For large DMBS partitions, an inode for every 1 MB of data space should be sufficient, thus saving 99.9% of the overhead from the default settings.

Capacity Revisited

It is safe to say that most sysadmins have at one time or another dealt with complaints about lack of storage capacity. Obviously, this happens after the system is already setup, and often takes the sysadmin by surprise. In the simplest case, the sysadmin would be able to quickly add additional storage. Since this is not always feasible, the sysadmin would have to perform additional capacity planning.

This common scenario demonstrates that planning for disk capacity is often a continual process. Thus sysadmins need tools that enable them to monitor and react to storage usage trends proactively.

One such category of tools is Hierarchical Storage Management (HSM) systems, which archive unused or old files to offline storage, freeing valuable disk space. If an archived file is requested, it is retrieved automatically, albeit with a slight delay. (There are also homegrown systems that simply compress rarely used files). This may be an effective solution, depending on the nature of the files being archived. Such archival systems vary tremendously in functionality and warrant a separate discussion. For an introduction, I recommend reading the comp.arch.storage FAQ.

Another way that one can keep track of the changing disk utilization is a utility like the (exceedingly simple) script that I provide in the appendix. dcp (disk capacity profiler) is a Perl script and should be run at regular intervals from cron. dcp accepts a file system name as an argument and will report the number of hours (and days) before the specified file system is full, based on the average historical rate of usage. I have used this script to monitor /var/mail, which exhibited fairly regular growth patterns, and was able to address the decreasing available space well before it became a problem. Of course, this script does not protect from a sudden spike in usage; but it is fairly effective, especially if used at regular intervals appropriate to a particular file system (for instance, twice daily). I should also mention that most system monitoring packages (like Openview, Unicenter, etc.) also provide similar profiling tools.

Finally, the role of system policies should not be overlooked in disk capacity planning. For instance, you might have a policy that each user is only allocated 50 MB of disk space, and this policy might be enforced with quotas. Another example is the /var/mail mentioned above: once notified, the company management issued a policy that allowed storage of only the last 30 days of mail (for legal liability reasons). This policy was enforced with nightly mail purges, and the disk usage has since stabilized.

Speed: The Disk Drive

Since typical disk drives are about 30,000 times slower than typical memory, the speed of the storage subsystem is crucial to the overall performance of the system. Unfortunately, it is rarely as simple as buying the biggest and the fastest equipment, despite what most vendors would have you believe. Once again, several of the parameters quoted by disk drive manufacturers are misleading.

Let's start with the fact that, as mentioned above, the storage vendors are referring to a MB as one million bytes. Thus, a quoted 20 MB/sec burst speed would pump about 5% (917,520 bytes per second) less than one would expect.

Another often-touted parameter is burst speed, which is the rate at which the disk drive's embedded controller transfers data to the SCSI bus. This speed can only be sustained when the disk can supply the data as fast as the host can consume it (which only happens in an ideal situation when the data is already in controller cache and the host is not busy dealing with other data). In reality, there is considerable time associated with seeking and reading from (or writing to) the disk, and thus, a single disk with a 40 MB/sec burst speed is unlikely to be much faster than a similar disk with 10 MB/sec or 20 MB/sec burst speed.

One parameter that is much more indicative of the true performance of the drive than the burst speed is the range of internal transfer speeds. This is the range of speeds at which data is read from or written to the disk platters. It is always lower than the burst speed (sometimes a magnitude lower), and almost always governs the overall speed of the transfer.

You should also be cautious of vendors' use of the average seek time, especially since the maximum seek time can be much longer than twice the average. Moreover, the average seek times do not vary significantly between commodity disk drives. This means that for most servers (where disk activity is typically random-access) the most important parameter to consider in choosing disks is their rotational speed. In practice, disk drives that have the same rotational speed perform about the same, regardless of their size and capacity.

Speed: The SCSI Bus

There are three electrical signaling conventions used in SCSI disk drives today: single-ended, differential, and low-voltage differential (LVD). The main difference between them is the length of cable that they are able to support. This is obviously important, especially since the "wide" SCSI bus is capable of supporting up to 16 devices, and the wiring within these devices counts in the total cable length. However, there is no significant performance difference between the three signaling conventions.

The SCSI standard defines several ways of increasing throughput on the bus. Namely, the mentioned "wide" approach increases throughput by doubling or quadrupling the width of the data bus from 8 to 16 or 32 bits. Additionally, the "wide" SCSI bus doubles the maximum number of devices that can be connected to the bus to 16. The "fast" method increases the throughput by increasing the bus clock rate, from 5 to 10, 20, or even 40 MHz. Combined together, "fast/wide" configurations are able to achieve 80 MB/sec throughput (160 MB/sec with Ultra-3). Table 1 contains a summary of the three SCSI standards in use today and their characteristics.

The SCSI standard has changed over the years to address the ever-increasing need for more bandwidth, as well as an ability to connect more devices over longer distances. Fibre Channel (FC) is part of the SCSI-3 standard that addresses all of these issues extremely effectively. Specifically, FC allows distances up to 10 km and transfer rates up to 100 MB/sec over optical fiber, as well as an ability to connect a large number of devices in a variety of topologies. Refer to the bibliography for more information on FC.

Speed: RAID

RAID implementations are most often mentioned in context of preserving data integrity and availability (a discussion of this topic follows). Additionally, RAID is often detrimental in maximizing the performance of the disk subsystem. RAID 0, RAID 0+1 (or RAID 1+0), and RAID 5 are the most commonly used implementations that can accomplish exactly this.

The best performing RAID type is RAID 0 (striping), which derives its improved performance from a more efficient utilization of the disks and the SCSI bus. In a RAID 0 configuration, small chunks of data are read and written on physically different disks. One negative aspect of RAID 0 is that a failure of a single member disk will render the RAID set useless. Mirroring is often added to striping to improve reliability, while preserving most of the performance benefits of RAID 0. RAID 1+0 (mirroring + striping) and RAID 0+1 (striping + mirroring) also have a fringe benefit of reducing drive utilization for reads: since mirroring stores identical data on at least two disks (sub-mirrors), and it can be accessed from the least busy of the sub-mirrors.

The difference between RAID 0+1 and RAID 1+0 is the order in which mirroring and striping are applied. That is, RAID 0+1 stripes a number of disks together, and then mirrors the data onto an identical set of disks. In this case, if one of the disks in either of the sub-mirrors become unavailable, the whole sub-mirror becomes unavailable as well. RAID 1+0 mirrors a number of disks and then stripes across the mirrors. This configuration significantly improves data availability, since it can withstand the failure of up to half the total number of disks (provided no two disks within any mirror fail at the same time). Check with your RAID software or hardware vendor to see which mirroring and striping combination they implement.

RAID 0+1 and RAID 1+0 are twice as expensive as a similar RAID 0 configuration because they require twice as many disk drives. RAID 5 offers decreased cost while maintaining a reasonable level of integrity, but it can be considerably slower than RAID 1+0 or RAID 0+1, especially for writes. It does provide good random access performance suitable for most multi-tasking systems, and is thus the most widely used of the RAID types.

RAID functionality can be implemented either on the host (as with software like Veritas and DiskSuite) or embedded on a SCSI controller. In the former case, there is typically a 5-10% performance penalty on the host's CPU utilization. Most hardware RAID controllers also offer extra cache, which can speed up I/O considerably (especially if the cache has a battery backup, which allows it to safely commit the data into faster memory and then write it to disk when it's convenient). To this effect, it is recommended to use hardware RAID controllers for write-intensive RAID 5 implementations.

Speed: Putting It All Together

Configuring the storage subsystem for speed requires an intimate knowledge of the applications that will utilize the disks. Is the application write-intensive, or is mostly read-only? Does the application perform sequential reads (as in database table scans), or is the system mostly random-access (as with multi-user systems, where users are performing a variety of tasks). Below are some general tips and tricks on getting the best performance out of the disk subsystem:

Never exceed the SCSI cable length. Performance and reliability cannot be assured if the prescribed length is exceeded. In the total length, you need to consider the wiring inside the disk drives and add about a foot for each pair of connectors that the signal must cross.

Separate high-bandwidth from low-bandwidth devices. Typically, slower storage devices such as tape and CD-ROM drives (as well as disk drives with lower burst speeds) result in higher SCSI bus utilization, which may impede the throughput from higher-bandwidth disk drives. Of course, if the low-bandwidth devices are not utilized at the same time as the high-bandwidth devices (for instance, if most of the disk activity is during the day and backups take place only at night), their co-existence on the same SCSI bus is inconsequential. Otherwise, it is advisable to separate the devices onto different SCSI buses.

Separate frequently used partitions. If two frequently used partitions are on the same disk, the disk is likely to suffer from high utilization, which leads to extremely poor performance. This condition can be greatly reduced if the partitions reside on different disks. If two "hot" partitions must be on the same disk, ensure that they are located close to each other to minimize seek time.

Ensure no more than 60% disk utilization. Higher utilization leads to increased disk response time, which leads to severely degraded performance. To remedy, separate frequently used partitions onto different disks.

Separate sequential from random-access partitions. Sequential performance suffers tremendously from being mixed with random-access requests. A typical disk is capable of about 1000 I/O operations per second in sequential mode compared to less than 200 in random access mode. For instance, it is beneficial to separate the partition where database data is stored (mostly sequential table scans) and the transaction logs (mostly random-access writes) onto different disks. Additionally, partition performance can be substantially increased if it is placed in the fastest part of the disk, which for most disks is in the lowest numbered cylinders.

Separate sequential from random-access disks. In sequential mode, as few as 4 disks can easily saturate a fast/wide SCSI bus, while up to 90 disks can be configured on the same bus for random-access conditions. In general, for sequential mode, the optimal number of drives per SCSI bus can be calculated by dividing the maximum SCSI bus burst speed by the average of the internal transfer speed range of individual disks. For instance, an 80 MB/s Ultra-2 bus can reasonably support at most 10 9.0-GB disks with the internal transfer speed range of 6.2-9.4 MB/sec. Regardless of the number of disks, random-access conditions rarely lead to a saturated SCSI bus; thus, in most cases, it is safe to configure as many disks as the cable length restriction allows.

Separate sequential from random-access disks, again. Sequential performance is restricted only by the internal transfer time, so for sequential data use disks that have the best internal transfer time. Random-access performance is limited mostly by rotation speed, so for this purpose, pick disks that have the highest rotational speed (measured in revolutions per minute, or RPM) which today goes as high as 10,000 RPM.

Divide the I/O among many disks with striping. For both sequential and random-access requests, an appropriate striping configuration can greatly improve the performance. Use RAID 0, RAID 0+1, RAID 1+0, or RAID 5 to improve sequential disk performance. Use RAID 0, RAID 0+1, or RAID 1+0 for random-access writes.

Set the appropriate chunk and stripe size for best sequential striping performance. Sequential access performs best when the I/O is split between at least four disks. That is, stripe size should match the typical I/O size, and the chunk size should be such that at least four disks are utilized. For instance, if the typical sequential I/O size were 8 KB, the stripe size should also be 8 KB, and the chunk size should be 2 KB so that four drives are evenly utilized. In any case, the chunk size should never be smaller than 2 KB.

Set the appropriate chunk and stripe size for best random-access striping performance. In the random access scenario, it is often difficult to determine the typical I/O size, in which case the best choice of chunk size is 64 KB or even 128 KB in order to minimize the drive utilization. If typical I/O size is available, chunk size can be calculated by dividing the typical I/O size by the number of member disk (or number of member disks minus one for RAID 5). For instance, if the typical I/O size is around 60 K, and there are six disks used in a RAID 5 configuration, the chunk size should be 60 KB/(6-1) = 12 KB.

Implement combinations of RAID to fit particular needs. Don't settle on only one RAID type for the whole system; implement different RAID levels for each part of the system based on the predominant I/O characteristics. (RAID 5 is not particularly suited for write-intensive applications, while RAID 0 is perfect for them).

Redundancy/Data Integrity

Different systems (or different parts of the same system) may have varying requirements for the level of redundancy or data integrity. It's a simple truth that disks (and controllers) will fail, and the guiding configuration question is simply how much downtime a particular system can afford.

There are three key concepts in this area. Mean Time Between Failures (MTBF) is usually quoted by the disk vendors, and it is the average time between failures of a disk drive. Mean Time To Data Loss (MTTDL) emphasizes data reliability; MTTDL has the same value as MTBF for a single disk but can be substantially improved in some (but not all) RAID configurations. The last concept is Mean Time To Data Inaccessibility (MTTDI), which measures overall availability of the data, and depends not only on the MTBF and MTTDL of the disks in the subsystem, but also on the reliability of the host controllers. All three parameters are usually measured in hours. Note that MTBF, MTTDL, and MTTDI are statistical parameters (averages), and do not guarantee that the actual disk subsystem will or will not fail after the prescribed number of hours. Rather, these values are useful in comparing the relative redundancy/data integrity of various disk configurations.

Specifically, while the MTBF of a newer single disk can be around 1,000,000 hours, MTBF of a disk subsystem consisting of 1,000 disks is drastically reduced to 1000 hours or about 42 days! (Better have a spare ready.) This is exactly the case with RAID 0, and since the loss of a single disk will render the whole stripe unusable, MTBF is also the MTTDL. Mirroring (RAID 1) improves the MTTDL tremendously: for a two-way mirror, the MTTDL is the square of the individual disk MTBF, or over 100 million years for the 1,000,000-MTBF drives! The reliability of RAID 0+1 and RAID 1+0 is comparable to that of RAID 1 configurations; however, RAID 1+0 delivers significantly improved MTTDI, since up to half of the total number of disk drives can fail before data becomes unavailable.

For RAID 5, failure of two member disks will result in loss of data, so its MTTDL is the square of the MTBF of individual disks divided by one less than the total number of disks in the RAID set. In all the RAID scenarios, hot sparing will improve MTTDL.

For a proper calculation of MTTDI, one must consider the MTBF of each of the components of the disk subsystem, including any cabling, RAID controllers, and host controllers. For instance, if all of the system's disks reside on a single controller, the failure of this controller (or one of the cables) will render the data unavailable, although no disks have failed and data integrity is preserved. MTTDI is not formally defined in the industry, but is still a very useful metric for the system design.

Of course, you still must find a perfect balance between the cost of a particular RAID implementation, the needed reliability, and desired speed. That is the reason that a mix-and-match approach to RAID levels within a system is recommended.

Ability to Grow

Once the system is in production, situations often arise that require additional disk space to be added to the system. This can be a relatively painless process - as simple as adding more disks - provided the system is properly prepared. Host-based RAID software like Veritas Volume Manager (and to some extent the older Solstice DiskSuite) can be quite useful in such situations. With the use of such software (and, to be fair, with some of the hardware-based RAID systems available today), one can painlessly grow file systems, add additional RAID protection, and move partitions between disks. Additionally, having more than one SCSI controller will not only improve performance and reliability, but also allow for future growth.

Conclusions

In this article, I provided an overview of the components of the disk subsystem that need to be taken into account when configuring systems for optimal performance. In general, effective capacity planning is based on a thorough understanding of all of the system components involved. Without an all-inclusive approach, systems tend to be either severely strained or under utilized. The goal of capacity planning - and the philosophy behind it - is to attain the state of perfect balance between cost and performance.

Bibliography

Wong, Brian L. 1997. Configuration and Capacity Planning for Solaris (TM) Servers. Prentice Hall.

Cockcroft, Adrian. 1994. Sun Performance and Tuning: Sparc & Solaris. Prentice Hall.

Gray, Jim. (Editor) 1993. The Benchmark Handbook for Database and Transaction Processing Systems, Second Edition. Morgan Kaufmann Publishers. http://www.benchmarkresources.com/ \
handbook/introduction.asp

SCSI Trade Association: http://www.scsita.org/whitepaper/techinfo.html

Adrian Cockcroft's frequently asked questions:
http://www.sunworld.com/common/cockcroft.letters.html

SunWorld columns, Performance Q&A by Adrian Cockcroft: http://www.sunworld.com/sunworldonline/common/ \
swol-backissues-columns.html#perf

SunWorld's Site Index, Storage:
http://www.sunworld.com/common/ \
swol-siteindex.html#storage

Answers to frequently asked questions for comp.arch.storage:
http://alumni.caltech.edu/~rdv/comp-arch-storage/FAQ-1.html

What is SCSI: http://www.whatis.com/scsi.htm

Ultra SCSI to Fibre: The Preferred Performance Path:
http://www.quantum.com/src/whitepapers/fibre

About the Author

Dave D. Zwieback is the Technical Director of inkcom (www.inkcom.com), a New Jersey-based consultancy specializing in UNIX, Networks, Security, and Internet/Intranet development. He can be reached at: zwieback@inkcom.com.