Article

Setting Priorities

Larry Reznick

nice(1) is nice. By prefixing a command line with the word "nice," SVR4 users can reduce the impact of their programs by 10 levels of priority. If users don't run nice, the system's default priority, which is set to level 20 out of 40 total levels, applies. Increasing the niceness of user programs by 10 can make a real difference to system performance. But just try to get your users to do that regularly.

renice(1) is nicer. This BSD extension to SVR4 lets users change the priority of their own jobs. Only the system administrator can reduce a job's niceness, making the job take more of the system's attention. But users can make their own jobs nicer to the system by setting one of the 20 niceness levels for any or all of their processes. renice has no default priority, though. Users must explicitly name some priority change amount to apply to all of the process IDs (pids) named. Again, this doesn't happen automatically. It's a dirty job and the system administrator is probably going to get stuck with it.

On SVR4, renice's work is done with priocntl(1). As with renice, all users can apply priocntl to their own jobs, but only a root user can apply it to any job. priocntl has more control over process scheduling than nice or renice. System administrators can also change the scheduler's priorities for handling all jobs using dispadmin(1M).

Some of this process priority information is available from ps(1). SVR4's version of ps shows detailed information when used with the -elf option. However, two other options show additional ps information useful for priocntl: -j gives process group ID numbers (pgid) and session ID numbers (sid), and -c gives scheduling classes and global priority levels. Figure 1 shows the information generated when these two options are added to the -elf option. Notice that -elf shows the PID (process ID), the PPID (parent process ID), the C (CPU's utilization percentage), the PRI (process priority), and the NI (nice number). The priority numbers in the PRI column show niceness -- that is, the higher the number, the lower the priority -- but on a larger scale than the NI numbers, which are constrained into the nice command's range.

ps's -j option adds the PGID and SID numbers between the PPID and the C columns. These numbers are useful in priocntl when you want to display (priocntl -d option) or set (priocntl -s option) priorities by specific group or session IDs (add the -i pgid option or the -i sid option to priocntl's command line).

ps's -c option is more significant when using priocntl. Using it replaces the C column with a CLS (class) column and changes the form of the PRI column to show the system's global priority levels. The CLS column shows abbreviations for the three job scheduling classes.

Global priority levels in the PRI column are very different from niceness numbers in two ways. First, the numbers show priority, not niceness -- that is, the higher the global priority number, the higher the priority. Second, the global priority numbers go very much higher than any niceness numbers. Figure 2 shows an abbreviated sample output of ps -elf, while Figure 3 shows a similarly abbreviated sample of ps -elfjc. In those figures, the ADDR, SZ, WCHAN, STIME, TTY, and TIME fields have been cut.

Job Scheduling Classes

While many system administrators are familiar with niceness levels, not all may be familiar with global priority levels. These levels are configurable in the kernel. On SVR4, jobs are in one of three classes: time-sharing, system, and real-time. Altogether, these three classes comprise the 160 default global priorities.

Time-sharing processes are the typical processes run by every UNIX system. These processes vary their priorities, sharing their use of the CPU with other processes. Time-sharing processes have the lowest global priority levels, ranging from 0 to 59.

System processes are the processes run by the kernel, such as those run by init(1M) and configured in /etc/inittab(4). These processes don't vary their priorities. User processes are never in the system class, even when they call the kernel to do some work. System processes have global priority levels ranging from 60 to 99, so the lowest system class priority level is higher than the highest time-sharing class process.

Real-time processes are so critical that they take precedence over system class processes. The lowest real-time priority level is higher than the highest system class process. Like the system class does, and unlike the time-sharing class, real-time processes use a fixed priority scheme. Global priority levels for real-time processes range from 100 to 159. Once a real-time process enters the scheduler, no other process -- not even a system process -- will get control again until the real-time process finishes or relinquishes its time slice.

/etc/conf/cf.d/mtune(4) contains several tunable parameters for the scheduler. Figure 4 shows default settings for some of the scheduler's parameters. RTMAXPRI defines the real-time class's maximum priority within the class. TSMAXUPRI defines the time-sharing class's maximum user-settable priority, which users may change using priocntl. The value in the default column, 20, represents both the minimum and maximum applied, ranging from -20 to +20. MAXCLSYSPRI identifies the maximum number of system class priorities. RTNPROCS and TSNPROCS identify the number of process levels for the real-time and time-sharing classes.

Scheduling Priorities

Once a process has used up or voluntarily given up its time slice, the scheduler is free to give another process a time slice. Ideally, the system's CPU resources are spread evenly across all jobs, but the scheduler rewards nice jobs and penalizes piggy jobs in the time-sharing class. There is no reward and penalty scheme for the system class and the real-time class. The system class is off limits to all users -- even system administrators unless they're changing the kernel. priocntl gives users control over their own time-sharing and real-time processes, and gives administrators control over all time-sharing and real-time processes. dispadmin gets or sets a class's priority tables, although only a root user can set the tables.

Figure 5 shows the time-sharing table's data, as output by the command

dispadmin -g -c TS

Detailed information about the time-sharing dispatcher parameter table is in ts_dptbl(4). Each row represents one priority level.

The first column in any row is the quantum, which is the number of time slices given to a process at that row's priority level. The RES value shown at the top of the dispadmin output indicates the resolution of the quantum column in fractions of a second. The quantum is set at 1000 by default, and the quantum column shows milliseconds. Each time slice is defined by HZ in /usr/include/sys/param.h and is echoed in /etc/default/login. HZ is the real resolution of your system in clock ticks per second. The tables may show millisecond, microsecond, or even nanosecond resolution, but any quanta with greater resolution than HZ are rounded up to the next HZ value.

For example, my Esix SVR4 system and a client's SCO system both use HZ=100. On those systems, priority level 0 uses 1000 milliseconds (1 second) and priority 59 uses 100 milliseconds (.1 second). If any quantum showed less than 100 milliseconds or had a remainder less than HZ when divided by HZ, such as the 24 in 1024 because HZ is 100, the system would round it up to the next HZ increment. So, 83 milliseconds would round up to 100 milliseconds, 257 milliseconds would become 300 milliseconds, and 1024 milliseconds would become 1100 milliseconds. To see the resolution in HZ rather than in the default milliseconds, add the -r option, such as:

dispadmin -g -c TS -r $HZ

Using -r changes the resolution, and in this case would set it to the HZ value.

Recall that the dispatcher table's priority level numbers show higher numbers for higher priorities. That means processes running at 0, the lowest priority in the 60 time-sharing levels, are allowed the most time, and processes running at 59, the highest priority, are given the least time. Processes that don't run often are allowed to run for a long time when they finally do run. High priority processes run frequently but briefly.

The second column identifies the priority level to use the next time the process gets its turn if its current time slice expires. A Process's time slice expires when the process runs without sleeping and exhausts its time. Other interruptions pause the process's clock. Returning from the interrupt continues the process's clock right where it left off.

Notice that in Figure 5 the expired processes have their priorities reduced by 10 levels if they're high-priority, but by half if their priority is lower than 20. Level 59, the highest priority level in the time-sharing class shown, gets a quantum of 100 milliseconds. If a process at that level uses up all of its time without sleeping, next time the process will run at level 49, allowing any higher priority process to run first. At that new level, the process gets a 200-millisecond slice, but if the process uses all of that up next time, it will be reduced to level 39 (400 milliseconds) (not shown on the abbreviated table), then to level 29 (600 milliseconds), then to level 19 (800 milliseconds), then to levels 9, 4, 2, 1, and finally 0 (all 1000 milliseconds each). Such a voracious process would get big time slices, but with decreasing frequency, so as to avoid dragging the system down.

The third column names the priority level assigned to the process if it sleeps. Processes voluntarily giving up their time slices get rewarded with a higher priority. When a process is blocked, waiting for I/O, this return-from-sleep column also applies. In Figure 5, sleeping processes with low priority have their priorities raised by increments of 10 levels. Starting with level 40, priorities are raised half the previous increment. So, sleeping processes that initially run occasionally, but with a large quantum, eventually rise to become frequently run processes, although with a small quantum.

A process at any priority level may have to wait for other, higher-priority processes. Column four sets the maximum time in seconds a process may wait for its turn to execute. If any process waits longer than this number of seconds, the process is compensated for being so patient. Column five contains the new priority level given to the delayed process. Typically, column five contains the same level numbers as column three, the sleep return column. In Figure 5, every priority level has a five-second maximum wait time. If any low-priority process is busy waiting for more than five seconds, its priority is increased 10 levels. The process will not necessarily execute right away if there are still many higher-priority processes, but this change increases its likelihood of getting executed. If the process is again forced to wait longer than five seconds, its level will increase by another 10. Eventually, such a process will rise to a top priority and execute, falling or rising from there according to its CPU usage.

Use the command

dispadmin -g -c RT

to see the real-time dispatcher parameter table. This table is described in rt_dptbl(4). An abbreviated version of it is shown in Figure 6. This table is far simpler than the time-sharing table. As before, each row represents one priority level, but there is only one column. Real-time processes are explicitly assigned one level and they stay there unless someone manually changes them.

Remember that the lowest priority real-time process, level 0, is higher than every other system and time-sharing process. Once a real-time process starts, little else on the system will get a chance. At real-time level 0, a process has a 1000 quantum. When that expires without sleeping or blocking, the scheduler will look to see if there's another process above or at the same level. If not, this same real-time process will execute again because no system or time-sharing process is at a higher priority. Thus, once any real-time process starts, it drags the system away from any other work. The down side is, of course, that other system work suffers. On the up side, real-time processes get the full attention of the CPU, and finish soon, as they should -- otherwise they shouldn't use real-time priority. Ongoing real-time processes shouldn't run on the same system as other time-sharing processes.

If you want to tune your time-sharing or real-time dispatcher parameter table, use dispadmin -g to get the current table's settings and redirect the output to a file. Edit that file to use whatever settings you prefer. Don't add any new priorities because the new table must have the same number of priority levels as the original table the -g option showed. If you want a different number of priorities, you must change the relevant tunable parameters in the kernel or in the space.c file in either /etc/conf/pack.d/ts or /etc/conf/pack.d/rt, and then rebuild the kernel. When you're finished editing the dispatcher table file, use the dispadmin -s option.

For instance, if you want to edit the time-sharing table, you might execute:

dispadmin -g -c TS >ts_dptbl.new

After editing the file, you can set the new table in the kernel's space with the command:

dispadmin -s ts_dptbl.new -c TS

Changing Priorities

The most important issue to decide when changing a process's priorities is probably the easiest to decide. Should the process be a time-sharing process or a real-time process? Typically, the answer to that question is time-sharing. But if you need to set an occasional real-time process, check whether your system is currently configured for real-time processing. Both dispadmin and priocntl can tell you this with their -l options. If you use dispadmin -l, only the class names appear. priocntl -l gives slightly more information, as shown in Figure 7.

If your system doesn't have the real-time class and you want it, you must remake the kernel. Edit /etc/conf/pack.d/ts/space.c and find a line reading

EXCLUDE:RT

SVR4 automatically includes the real-time class by default. If someone excluded it, find out why before you enable it again. To enable the real-time class, replace the exclude line with

INCLUDE:RT

Rebuild the kernel and reboot. The class should appear in the next dispadmin -l or priocntl -l option output. You can do the same thing to eliminate the time-sharing class for systems that you want to dedicate to real-time processing, but given real-time's privileged priorities, there are not many reasons for eliminating the time-sharing class.

priocntl has three primary options: -d, to display scheduling parameters; -s, to set scheduling parameters; and -e, to execute a command using specific scheduling parameters. Anyone may display process parameters, but users can set only their own processes. Of course, the administrators may set other users' processes when they have root permission.

When displaying or setting scheduling parameters, using an -i option followed by a keyword identifies which kinds of process information to apply to the list. After the keyword comes a list of ID numbers associated with that keyword. The obvious keywords use the pid or the ppid, but pgid and sid come into play here also. This is where adding the ps -j option, which shows those values, may be helpful. Most of the time, select processes are changed using one of these keywords, but you could use the uid or gid keywords to change all of the processes associated with a particular user or a group of users. Use the ID numbers as they appear in the /etc/passwd(4) or /etc/group(4) files, not the user or group names.

Use a -c option to specify RT or TS class. If you don't use the -c option, all of the ID numbers named must be in the same class. Each of those -c classes have special options associated with them. RT class options let you set the process's priority level and quantum. TS class options let you set the priority level and the user-changeable limit for the processes identified. Using the -c option, you can change a running process from one class to another.

Finally, the -e option lets you execute processes in either the TS or RT class, applying other class options as needed. This is identical to the nice command's operation but you have more control. With the -s option simulating renice but exceeding renice's control, the priocntl program puts all of the scheduling operations into one place. Use priocntl to adjust priorities where necessary or launch programs with the appropriate priority. Programmers at your site may use the priocntl(2) function to embed this identical control in their programs. Judicious use of priocntl and sleep will improve your system's performance.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX, MS-DOS, and OS/2. He teaches C, C++, and UNIX language courses at American River College and at the University of California, Davis extension.