Managing Availability of SAP R/3
Ever since computers were invented, programmers, administrators, and, ultimately, users have suffered because of performance and availability problems. In the case of older environments, such as OS/360 (now OS/390) the reasons for problems and the methods to handle them have become well understood over time and more readily addressed with management tools.
With newer computing technologies, notably client/server applications, the problems have multiplied in both number and complexity. Nevertheless, certain basic techniques common to most computing technology can be applied to distributed environments to help administrators identify, or, better still, avoid frequent problems. Although this article focuses on managing availability of SAP R/3 applications, many of the methods discussed can be extrapolated for use with other business critical applications, such as Baan, Oracle Financials, or Lotus Notes.
The most important point is that any application, and R/3 is no exception, needs to have certain processes running before it can respond to user input. Examples with R/3 are the Oracle server process and the message server process. Normally, these are started and run until R/3 is shut down. Unfortunately, a process may occasionally die, and, as a result, nothing can run. Such a failure is always problematic, but imagine this happening in the early hours of the morning, with end users showing up for work only to find the system dead. The SAP internal alarms may or may not catch this situation, because they, too, may be affected by the process in jeopardy. To avoid such unpleasant surprises and their serious business consequences, a reliable means of detecting that a process is no longer running is extremely important.
No application can run properly without sufficient resources. The basic resources - CPU cycles, main and virtual storage, and disk storage - are supplied to the application processes by the operating system. In the case of R/3, this is usually UNIX or Windows NT. Since bottlenecks at this level usually affect the entire R/3 complex, the underlying operating system is the first place to look if system-wide problems occur.
If the main storage available is too small for the workload to be processed, it will cause excessive swapping or paging. If the swap space or the paging file is too small, this will result in poor performance across the board. Unfortunately, once the system is swapping or paging itself to death (a state known as thrashing), it is often difficult or impossible to start any tools to find out what is happening, because their effectiveness will also suffer.
In the worst case, the system may have to be restarted, leaving the administrator with little or no information to help her identify the original problem. Therefore, it is vitally important to have some form of monitoring in place that can provide early warning of impending problems by setting off an alarm if the swap rate starts to rise above acceptable thresholds.
Disk space problems also fall into this category. Once a file system or directory is full, there is little or nothing that can be done. If you don't want to spend your time in front of a Computer Center Management System (CCMS) screen making sure that all file systems are within comfortable limits, you must have some form of automatic, regular checking mechanism in place. Finally, your CPU might just be too slow to handle the workload. This is less common, because R/3, like most commercial applications, is not generally CPU-intensive. If inadequate CPU is a problem, it will happen at times of change, such as going into full production, or adding a new module. However, a poorly written ABAP (SAP's high-level programming language) program can eat CPU cycles, so monitoring for this type of excessive usage is mandatory to avoid a single program dragging down the entire system.
Since commercial applications are primarily concerned with storing and retrieving data, the performance of the database is one of the key factors for a successful installation. Unfortunately, the architecture of R/3 (see Figure 1) poses a major problem in this respect.
Most database applications define the fields of each record to the database and can therefore be tuned using the tools that came with the database itself. R/3, however, does not work this way. Instead, most R/3 tables consist, as far as the database is concerned, of a single file, the so-called "record bed." This single large file is split internally by R/3 into individual fields, such as name, address, and so on. For effective performance, it is vitally important that proper indices are maintained, and, due to this characteristic of the R/3 architecture, these must be maintained at both database and R/3 levels. Other problems that can occur with the database are usually documented in the log file, but there may be no free space in the log file or in the tablespaces for the database. Depending on the database type, this may result in a system simply stopping, with little or no warning. To avoid this, the messages sent to the log file should be permanently monitored.
There may also be problems with the physical storage of the database. This might result from frequently used files being stored on the same disks or controllers, leading to bottlenecks in data access. Or the database tables may need reorganizing. In an ideal world, critical tables would be reorganized daily or at least weekly, but some R/3 systems are so large that this is not practical, or the batch window may simply be too short for reorganizations except on the weekend. So, the database must be checked frequently to verify the fragmentation levels and to get a feeling for when performance starts to be negatively influenced.
When everything at the server level is working, the network may be at fault. Simple tests, such as ping-ing the R/3 application servers from the client can indicate whether a network problem exists. Changes to the hosts or the services file can result in being unable to reach R/3. The problem might also be on the client, for example, the TCP/IP installation.
A particular R/3 system is made up of "instances". The database server is an instance and so is each application server. A spool process runs within an instance, as does a dialog process or a batch process. Finding the correct mix is very important. Too many batch processes and too few dialog processes can cause poor response times for online users. The reverse will lead to excessive turnaround times for batch jobs.
Each instance has, in turn, a profile. In this profile, various parameters are stored that can influence the performance of R/3. One key parameter is the Shared Pool Space. If this runs out, serious problems result, but all R/3 does is write a message to a trace file. You have to check this file for the message "Shmget: Shared Pool Space exhausted. PoolKey=nn".
Checking the SAP system log using transaction SM21 is also an obvious first step when performance deteriorates. One interesting point: user-written programs, using SAP's high-level ABAP language, can write to the SAP system log. If this is done in a consistent and intelligent manner, then using a tool that monitors the SAP system log can provide full monitoring capabilities for your user-written ABAPs.
Now for an Update
One of the main features of the R/3 architecture is the concept of the database server. This is a separate process, or a separate physical machine, that processes all database I/O, and, more importantly, all updates to the database. While this feature provides for a database free of inconsistent data, it does constitute a potential bottleneck. It works like this: when a transaction is completed (e.g., the entering of a supplier's invoice) all the updates for that transaction are passed to the database server. A serial task is responsible for handling all the updates. If online clients are processing transactions faster than the update task can handle, a backlog builds up. This can cause the SAP internal lock-table to fill up, with sad consequences. You may have to increase the size of the lock table.
It is better, of course, to increase the throughput of the update task to avoid this situation. The key here, as for any online transaction processing system, is to minimize actual I/Os by setting buffer sizes properly. This is more of an art than a science, because the possible R/3 configurations run into millions, and each company has its own individual work pattern. There is no other single reason for poor performance that is as important as the SAP buffer pools. If buffers do not have a high enough hit ratio, then storage is being wasted that could benefit other buffers or spool tasks. If tasks are waiting for I/O, then the buffers are probably too small and are causing response times that are too high. There is no perfect recipe. You will have to monitor the buffers at times when the performance is good, and when it's not so good. The numbers you see will give you an indication of how you should set your buffers.
There have to be enough work processes defined to handle the workload. If the number is too small, someone will have to wait until another transaction is complete. If it is too high, you are wasting storage that could be used better elsewhere, such as to increase buffer sizes. The same applies to spool processes.
All these points must be borne in mind when managing an R/3 system. Any complex system requires management, and while a good system administrator can set up a system to get it running well, the situation changes immediately when external factors change. Examples are additional users, additional R/3 modules, additional R/3 systems using the same machines or network, additional disk capacity, growing file sizes, and many more.
It is the nature of client/server systems to be dynamic, because their very flexibility has led to their success. However, this places high demand on systems management. No matter how good an administrator is, there is a limit to what he or she can handle. This alone is justification for the acquisition and use of proper supporting tools.
Most of the potential issues mentioned can be determined either by using the SAP CCMS or by typing in UNIX commands such as ps -ef | grep gwrd | grep <SID>. But, this approach is equivalent to closing the barn door after the horse has bolted. It is not a good long-term solution to wait for disgruntled users to complain and then go looking for the problem. Commercial solutions, such as Boole & Babbage Command for R/3 are also available and can prevent many of the possible failures described in this article. With Command for R/3, harmful conditions can be proactively monitored, and effective thresholds and early warnings can be set. No tool will avoid all difficulties all of the time, but a substantial reduction in problems and downtime can be achieved with very little effort, freeing up your much-needed human resources to handle expansion and change.
About the Author
Frank Koensgen is a senior consultant with the Computer Software Organisation GmbH (CSO). Mr. Koensgen is an expert in customizing and tuning the SAP Base System, without which good SAP performance is impossible. Current SAP R/3 projects include release updates, client installation, monitoring, and data import/export. Frank Koensgen can be reached in Germany at +49 (0)2102 4084-0 or via email: email@example.com.