Customer Support - The View from the Other End of the Phone
Whether we like it or not, part of a system administrator's job is handling users requests. Using the definition of "customer," we can look upon users as customers and consider customer support part of system administration. In doing customer support, system administrators receive problems from users/customers, determine the nature of the problem, and produce a solution. When system administrators have problems, however, who do they turn to? That's right, customer support - vendor customer support. A support contract and an 800 number have saved the hide of many a system administrator. Just as the user may not see exactly what the system administrator does to solve the problem, the system administrator may not see all the steps and work that the customer support engineer does at the other end of the phone. This article will give a brief look into the other side of the 800 number, and what a customer support engineer does in the process of solving a problem. It will describe what outcomes you can expect from support, what information you should have ready when opening a support case, and what you can do to help solve your support case.
How Customer Support Is Organized
Although different vendors may organize customer support in various ways, the task is usually broken down into two levels - Front-Line and Back-Line. Front-Line support is composed of the people who answer the 800 number. It's their job to open a case, obtain information, and determine whether they can solve the problem themselves. Front-Line's job is to fix the problem or answer the question as quickly as possible. To do this, they use a database of past calls and solutions, online versions of the manual and man pages, a database of bugs, and other tools. If the problem is too complicated, beyond their expertise, or requires a fair amount of research, it is passed to the Back-Line support persons.
Back-Line support is composed of people with mid-level to senior-level system administration experience. They may specialize in specific parts of an operating system or in specific applications. It is Back-Line's job to pick up the case where the Front-Line left off. Back-Line people gather more information, work in the lab testing the problem on systems, and go over the source code. Back-Line support cases can take days, weeks, and months to solve. The more difficult the case, the longer the case will exist. Back-Line is also known as the "last-line". Once a problem comes to Back-Line support, it will stay there until it is resolved. Back-Line support can pass the problem to Development Engineering, say in the case of a bug, but Back-Line will still own the case and keep track of the problem's progress.
What Customer Support Does
For the purposes of this article, I'm going to lump both Front-Line and Back-Line support into one group. I will cover the process used to resolve a case as one general process with one owner.
Collect Problem Information - The first thing to do is collect as much information about the problem as possible. We need to get the full scope of the hardware and software configuration, what the exact problem is (an error message, a certain behavior, etc.), what seems to have caused the problem (I've heard "We haven't changed a thing" too many times), and any other data we think will help. This step is really an ongoing process. As we try different theories of what might the problem be, we will often ask for additional information as the case progresses.
One default question, that can be a little overused, is whether the system is at the latest recommended patch level. Each vendor designates certain patches as "recommended" and suggests that they be installed on all systems to keep them current. Although this may sound like a way of putting off the customer, it does serve a purpose. It puts the machine at a "known state" that support staff can work with.
Most customer support organizations have a case- or call-tracking program that documents work performed on each case. Call-tracking programs also allow the case database to be searched for known solutions. Thus, the first step in finding a solution to a case is searching the database for similar problems. This database also lets less-experienced support staff solve cases that would otherwise be outside their experience, learning as they go. The problem with case databases, however, is that they require clear and consistent phrasings of the problem's descriptions and solutions. Phrase the question wrong and you might not find the answer.
Another support resource used in conjunction with the case database is a bug database. As bugs are reported to a vendor, they are usually given a number and tracked. The end result of a bug report is a bug fix in the form of a patch. If a customer calls in with a known bug, the customer support person can say "Yes, that is a known problem, and it is fixed in patch number XXX", or "It is slated to be fixed in patch YYY." Often, there is a workaround that lets the customer get the job done.
If we have some idea of what the problem might be or we want to determine the scope of the problem, we'll suggest trying a few things, such as different options of a command, getting output from some analysis commands, changing a configuration file, and so on. Sometimes, performance-based problems can be solved this way. For example, if you are having network speed issues, we may suggest changing specific network settings to see whether the problem goes away.
One way to determine if the problem is isolated to a single machine or exists across multiple machines, is to try it in the lab. Most support organizations have some sort of lab where support persons can try different hardware platforms, different software settings, and so on. If the problem is not reproducible in the lab, then the problem is likely a configuration issue with the customer's machine. If, however, the problem can be reproduced, it is more likely evidence of a bug.
Another possibility is that the application is not being used properly, and the lab results show that given the exact same conditions the same problem occurs. Sometimes the manual does not cover what behavior should be expected from an application given a certain set of conditions. Even though a lab system reproduced the problem, further consideration could show that a particular interpretation of the manual could be wrong and that the behavior is okay.
Use the Source, Luke
Examining the source code to see where an error message originates can be helpful in determining what conditions will generate the error message, thereby pointing toward the source of the problem. The source code can also point out where further research needs to be done, or it can show that the problem is a coding bug.
As much as each case is different and takes its own path, the above steps form the basis of the case-resolution process. Sometimes a support person will skip a step, such as checking for previous cases, and the case may lag until that step is remembered. I try to follow these steps methodically so that I don't perform unnecessary work or unnecessarily delay the customer.
Once a support case is opened, it must reach some sort of resolution - the faster, the better for all involved. Not all cases are resolved to the full satisfaction of the customer, and it's not always easy to give the customer bad news. Here are the various end results of cases:
Fixed - This is the best resolution for all. The problem has been found and fixed, either in an existing patch or in changing how the system is configured.
Bug and Patch Requested - This means that the problem is a bug and is either being worked on or going to be worked on. It may take some time, but the problem will be fixed, depending upon the severity of the problem and the scope of affected users. In most cases, the customers are happy knowing that a fix is forthcoming.
Bug with Workaround - This means that the problem is a bug, which may or may not be fixed in the future, but there is a way to get around the bug. Some customers are willing to accept the limitation and the workaround. Others are less enthusiastic about workarounds.
Bug and Won't Fix - Some bugs are fairly trivial and may have a simple workaround. This means that they are fairly low priority and a business decision has been made not to fix them. These are usually nuisance problems, such as incorrect man pages, spelling errors in applications menus, and so on. It's not easy to tell customers "no", but it is sometimes necessary.
Unsupported Hardware - Most vendors specify which third-party hardware they have tested and will support. Supported means that if the vendor says that the third-party hardware will work with their product, they are on the hook to make it work. Sometimes, support may be limited to a particular firmware revision of the third-party hardware, meaning that newer and older hardware may not be supported.
Improper Usage - This result is usually encountered when customers try to push a system past its design parameters. For example, don't buy a desktop PC and expect it to become the backup server for the entire network.
Support personnel will need information besides your support contract number to start working on your problem. Here is a checklist that will help you be more prepared.
OS Version - The exact version number of the operating system you are using.
Hardware Configuration - To understand the problem, we must understand the setup of the computer. Details that seem unrelated to the problem may be important.
Patch list - A patch list can tell us if the first order of business is to install a newer patch. If a patch list is provided, we can look up existing problems, see what patches they are fixed in, and see whether that patch is installed. Sometimes the easy answer is to just install a newer patch. Problems can also be introduced by patches.
Exact Error Message - If the problem is the type that generates an error message, then getting the exact error message can help us look through our case and bug database for previous occurrences of the error. If that fails, it can help us know exactly what program is generating the error and even find the error in the source code.
What Preceded the Problem - The steps leading up to a problem can be almost as important as the problem description. It will also help if we need to try to recreate the problem in the lab. It also helps to know if the problem is repeatable or has only happened once. If you can come up with a test case that outlines the exact steps required to recreate the problem, this is extremely helpful.
Core Dump - If the system crashes (panics), and a core file is generated, then getting this to us can really help in figuring out what is going on. Be aware that core dumps can be large. If you don't have a good Internet connection, sending the core file via tape is probably about the best way to go.
What You Can Do To Help
In my years of doing customer support and using customer support, I know that it can be tempting to pick up the phone and call customer support at the first hint of a problem. Sometimes, the better the customer support, the easier it is to make the call. Before you jump on the phone, however, there are a few things that you can do to try to solve the problem yourself or, at the very least, provide additional information for us to deal with.
Read the Manual - As fundamental as this may sound, many support calls do come in from users who have not read the manual. The majority of these are from new or novice system administrators, and they just have not taken the time to go through the system administration books and want a quick answer.
Get Training - If you are unfamiliar with a specific application that your company is relying on, do the right thing and get some training. Don't rely on the vendor to provide support on how to use the application.
Read OS and Patch Release Notes - Most patch release notes describe the bugs that are fixed by the patch. In some cases, the release notes will suggest other options if the problem still exists. If you load patches from a patch set, you may not know exactly what bugs are fixed, so reading the release notes is fairly important.
Report System Status Accurately - Is it crashed, hung, or busy? Crashed means that the system has panicked and produced a core dump file and a panic message on the console. Hung means that the system is totally unresponsive but has not panicked. Busy means that the system is so busy that it cannot respond to users or the network. Crashed, hung, and busy are all very different, and we approach each symptom differently. Reporting a busy system as a crash will cause us to start down the wrong path.
Find the Scope of the Problem - See what other setting or factors seem to affect the problem. If you have a network problem getting to a specific server, can your system get to any other servers? If a certain command-line option causes an error, does it happen when you add another option, or remove a different option? Also knowing when the problem does not occur is very useful.
Use Analysis Tools - Most versions of UNIX come with a variety of system analysis tools that you can use to scope out the problem yourself. par is handy for figuring out exactly what is going on inside a program. sar, top, and ps are useful for determining system usage and what is going on with the system. For network problems, snoop is the key tool to use. These are the same tools that we would use to solve the problem.
I've approached the issue of tech support from my end of the phone and provided tips about what you can do to help research and solve your problem. You can also apply these techniques when dealing with the users you support, helping them to help you solve their problems. Good communication is at the heart of most problem resolutions, and informing your users of your procedures will go far toward making your life as a system administrator more productive and enjoyable.
About the Author
Timothy Swenson is a UNIX contractor with Taos Mountain, in Silicon Valley. After receiving a B.S. in Computer Science from Cal State Hayward, Tim spent 8 years as an Air Force Officer, doing sys admin and acquisition work. Tim is also the editor of the QL Hacker's Journaland writes for other Sinclair publications. He can be reached at firstname.lastname@example.org.