Article

Web Caching: Why the Continued Interest?

Ron McCarty

“World Wide Wait”, “56-kbps...subject to line conditions”, “It's the network, not our server”. All these statements are used in context with Internet bandwith. Luckily for most of us, high-speed Internet access is very affordable with T1 (1.54 Mbps) becoming almost the minimum for medium and large organizations. However, caching technology has remained popular for many organizations. Why is this?

Before looking at some of the reasons cache servers have remained popular, I'll briefly review the technology.

Overview

A cache server works by listening for requests for particular Web pages. When the cache receives the request, it checks to see whether it has the current page. If the cache server has the page, then the content is delivered to the browser for presentation. However, if the page is not present, then the cache server requests the page on behalf of the user and returns the page to the user and stores the page for future requests. The first request will not likely see a speed increase, and may actually be slower than a direct connection, but later requests for the same page will be much quicker.

Cache servers have often been integrated with proxy servers, or proxy servers have often included caching capability, depending on your perspective. Proxy servers are often used to enhance security policy (all http requests through a firewall will be from one address, for example), and in smaller organizations have acted as the firewall and network address translator to connect multiple hosts to the Internet through one address. Since the proxy is already requesting pages on behalf of the client, it makes sense to cache those requested pages for subsequent requests.

Current Technology

Two basic methods are used by cache servers: application aware, and transparent. Actually, the application level cache is “technically” seen as a proxy server by the browser. The client browser is not aware that caching is happening, only that it is using a proxy server. Figure 1 shows the configuration parameters possible for Netscape Communicator. In Figure 1, ftp and secure socket layer (noted as “security”) proxying is possible and is supported by most cache servers as well as proxy servers. Cache servers can also cache ftp downloads. Secure socket layer connects, since they are encrypted, are not cached so the cache will only proxy those requests.

A transparent cache server listens for http requests and then caches the page for future use. (This configuration requires that the packets be routed through the cache server, which can be achieved on a firewall with a port redirect, or on a router with policy-based routing.) When other clients request the page, the transparent cache answers with the page like the application aware proxy; however, in this case, the client believes it is communicating with the Web server when in fact the cache server is spoofing addresses. It replaces the client's address with its own to make the request to the server and then replaces its own address in the reply packets and forwards them to the client as shown in Figure 2.

Transparent cache servers are very popular in large organizations where reconfiguration of a large number of browsers is not feasible. Cache servers can also support distributed models to enhance network performance based upon an organization's needs. For example, assuming your ISP is running a cache server that uses the same caching protocol as your server, you can configure your cache server to query the upstream cache before seeking the page itself. Of course, this same topology can be used internally. An example of internal use would be where the human resource department is firewalled off and has a dedicated cache server. Its cache server could be configured to query the organizational cache server before fetching the page. Of course, the topology does not have to flow upstream -- cache servers can also act as peers to each other. Each cache server will query the other before requesting the page from the Web server.

To support a distributed model, the cache servers involved need to speak the same protocol. Let's cover some popular offerings, as well as the cache protocols they support.

Squid (http://squid.nlanr.net/) uses the Internet Cache Protocol (ICP) developed for Squid and its commercial cousin NetCache (http://www.netapp.com/). ICP is an open standard documented in RFC 2186 (http://www.ietf.org/rfc/rfc2186.txt). An introduction to distributed caches using ICP can be found at:

http://squid.nlanr.net/Doc/Hierarchy-Tutorial/

Squid is not limited to ICP -- it supports Cisco's Web Cache Communication Protocol and Microsoft's Cache Array Routing Protocol, as well as several others. Squid will pretty much run on any modern UNIX operating system. Squid is often the choice for ISPs and comes with many distributions of Linux and *BSD systems.

Cisco's Cache Engine:

http://www.cisco.com/warp/public/cc/cisco/mkt/scale/cache/index.shtml

supports a proprietary protocol called Web Cache Communication Protocol (WCCP):

http://www.cisco.com/univercd/cc/td/doc/product/iaabu/webcache/ \

  ce20/ver20/wc20wcc2.htm

Cisco allows other vendors access to the protocol, but you should keep interoperability in mind and check with your vendor if you foresee a need for distributed cache servers. Both Inktomi's Traffic Server (mentioned below) and Squid support Cisco's WCCP. Many organizations have large investments in Cisco equipment and Cisco service agreements, so the Cisco Cache Engine could be the system of choice based upon brand name.

Microsoft has also chosen a proprietary protocol for cache communication called Cache Array Routing Protocol supported exclusively on MS Proxy:

http://www.microsoft.com/proxy/default.asp

Microsoft's protocol has been designed with built-in intelligence to support multiple cache servers that keep track of each other's hits as well as ensuring that connectivity is available to all cache servers within the distributed cache (referred to as a cache array by Microsoft). The protocol uses remote procedure calls, so placement outside a firewall should be considered very carefully. Hardening the host becomes much tougher when RPC must be available.

Apache (http://www.apache.org/), well known as the most popular Web server on the Internet, also has caching support (http://www.apache.org/docs/mod/mod_proxy.html) and should be a definite consideration for organizations seeking caching technology that already have Apache know-how. Apache is well known in the UNIX world, but it has also been ported to NT.

Inktomi's Traffic Server:

http://www.inktomi.com/products/network/traffic/product.html

available for both UNIX and NT operating systems, has an impressive list of customers:

http://www.inktomi.com/products/network/traffic/customers.html

but do keep in mind that the cache market does not have the marketing visibility of many other commercial products found in edge networks such as firewalls and routers. The Traffic Server is available for commercial versions of UNIX as well as NT.

Cobalts' Cache Cube

http://noram.cobaltnet.com/products/cache/index2.html

is a turn-key cache solution. In addition to its proprietary InstaCache Cluster technology, it also supports ICP.

Cache Popularity

Now that I've covered the technology, I will address why cache servers have remained popular despite dramatic decreases in data communication costs. Although most cache vendors focus on the money that will be saved by a cache server, administrators' arithmetic may not add up to what vendors claim. So, administrators are often turned off to the technology because the sales tactic emphasizes savings that may never be realized and adds yet another system to be managed. Administrators that take the extra time to study the technology, however, realize that there are very good reasons to deploy cache servers.

Small businesses often use proxy servers to connect multiple machines to the Internet so a cache server is a natural choice -- why turn off the cache portion when the cache will hardly be noticed and will likely speed up access for a small percentage of Web requests. There are no added capital costs to the company, nor likely any additional management costs.

Medium and large organizations have many needs that can be met by a cache and proxy server. Many firewalls are used to log http requests for analysis. By off-loading the logging to the appropriately sized cache server, the logging load will be moved from the firewall and one I/O disk bound bottleneck will be removed from the firewall. This will bring new life into an overworked firewall.

Many businesses are using internal firewalls to secure areas such as human resources and research and development areas. Internal firewalls are also becoming popular to limit network resources for remote user network access routers and modem pools. In those cases, a distributed cache server model can provide caching capability and logging capability. The logging capability can also be used to enhance monitoring where packet filters are used to give detailed reporting often not supported by packet filters. Additionally, a cache server can limit additional sites. In other words, connections to specific adult sites can be stopped by the cache server.

Overseas sites are often ideal candidates for cache servers because Internet billing is often packet based -- a 5% hit rate can mean a direct impact on the cost of the Internet access for a company that has high Internet utilization. Keep in mind that many countries' laws limit companies on what or how they monitor employees' Internet use.

Cache servers can also be used to solve Intranet sizing issues. Cache servers have the capability to rewrite URLs to point the request to a different Web server. This feature can be utilized to distribute mirrored Intranet sites, yet keep the same name for the site globally. For example, an overseas operation wishes to connect to intra.internalnetwork.com, your Intranet knowledge base. The overseas cache server rewrites the URL to point to overseas.intra.internalnetwork.com, and the user views the local copy of the Web server. This control can be very granular to make exceptions for dynamic pages or other requirements. Creative use of internal caching can often have much better returns than Internet caching.

Caching is often unpopular among marketing specialists and Web developers because cache servers can construe the number of hits a site receives. Although we cannot cover the issues here, you might wish to check out: http://www.mnot.net/cache_docs/ as a starting point to help ease selling cache technology to your users.

Caching, if well understood, and properly implemented, is a good technology to have in your net belt. n

About the Author

Ronald McCarty received his bachelor's degree in Computer and Information Systems at the University of Maryland's international campus at Schwaebisch Gmuend, Germany. After completing his degree, Ronald McCarty started his network career as network administrator at the Schwaebisch Gmuend campus. Ronald McCarty currently works for Software Spectrum, Inc. as a network engineer in the IT&S Network Services Project's team. He spends his free time with his two best friends in the world: his daughter, Janice, and his wife, Claudia. Ron can be reached at: mccarty@my-own-domain.to.