Caching The Web :

Improve your users browsing and save your bandwidth by using proxy servers to cache web pages.

David Guerrero <david@boe.es>
January 1999




Some basic concepts

The web is everywhere. Everyone uses it. Everybody talks about it. But in this less-than-perfect world, you know there are problems. Bandwidth is a problem. Web document latency (the time a document times to arrive to your browser once its URL is requested) is a problem. As more space of your bandwidth is used, latency of documents retrieved from the Internet increases.

Bandwidth is expensive, perhaps the most expensive element of an Internet connection.

Despite the fact that the web is growing fast, the same documents get requested and the same web sites are visited repeatedly. We can take advantage of this to avoid the downloading redundant objects. You would be surprised to learn how many of your users read the NBA.COM web pages, or how many times the GIFs from Altavista cross your line.

Even if you know nothing about web caching, you are probably using it with your web browser.

Most common browsers use this approach with the documents and objects you retrieve from the Web, keeping a copy of recent documents in memory or disk. Each time you click on the "back" button or visit the same page, that page is in memory and odes not need to be retreived. This is the first level of caching , and the technique can be expanded to the entire web.

The basic idea behind caching is to store the documents retrieved by one user in a common location, and thus avoid retrieving the same document for a second user from its source. Instead, the second user gets the document from the common place. This is very important when you deal with organizations in Europe, where most of the inbound traffic comes from the other side of the Atlantic, frequently across slow links.
Figure 1
The main benefit of this approach is the fact that your users’ browsing is now collaborative, and an important number of the documents your users retrieve are served in a very small period of time. In a medium-sized organization (between 50 to 100 users), you can serve up to a 60% of URL requests from the local cache.

The difference between a browser cache and a proxy-cache server is that the browser cache works for only one user and is located in the final user workstation, while the proxy-server is a program that acts on behalf of a number of web browser clients, allowing one client to read documents requested by others earlier. This proxy-cache server is located in a common server that usually lies between the local network and the Internet. All browsers requests documents from the proxy server, which retrieves the documents and returns them to the browsers. It’s the second level of caching in an organization. Figure 1 shows this type of network configuration.
Figure 2

A proxy-cache is not just a solution to the bandwidth crisis; it is also desiderable when a network firewall is needed to guarantee the security of your organization. In this case, the proxy-cache sits on a computer accesible from all local browsers, but isolates them from the Internet at the same time. This computer must have two network interfaces, attached to the internal and external networks and must be the only computer reachable from the Internet. Figure 2 illustrates such a configuration. The proxy-cache server must be accessible only by internal systrems to ensure that no one on the Internet can access your internal documents by requesting them from the proxy-cache. I will discuss access control to the proxy-cache later in this article.
 

Multi-level Web Caching

One step forward from this approach is the concept of a cache hiearchy, where two or more proxy-cache servers cooperate by serving documents to each other. A proxy-cache can play two different roles in a hierarchy, depending on network topology, ISP policies and system resources. A neighbor (or sibling) cache is one that serves only documents it already has. A parent cache can get documents from another cache higher in the hierarchy or from the source, depending on whether if has more parent or neighbor caches in its level.A parent cache should be used when there are no more opportunities to get the document from a cache on the same level.
 

Figure 3
Choosing a good cache topology is very important in order to avoid generating more network traffic than without web caching. An organization can choose to have several sibling caches in its departmental networks and a parent cache close to the network link to the Internet. This parent cache can be configured to request documents from another parent cache in the upstream ISP, in the case they have one (most do). Agreements can be made between organizations and ISPs to buid sibling or parent caches to reduce traffic overload in their links, or to route web traffic through a different path than the regular IP traffic. Web caching can be considered an application-level, routing mechanism, which uses ICP (Internet Cache Protocol) as its main protocol. Figure 3 is an example of how an organization can implement multi-level web caching.
 

What’s ICP?

ICP, Internet Cache Protocol, is a protocol used for communicating among web caches. A lightweigth protocol built on top of UDP, ICP is used to locate specific web objects in neighboring caches. Most transfers caches are done with the TCP-based HTTP protocol, but making the decision of where to retrieve an object must be done with a simpler and faster mechanism. Other information needed is which caches are down or have congested links.

One cache, in order to find the best location from which to download an object, sends an ICP request packet to all of its siblings and parent caches, and they sends back ICP replies with a HIT or MISS code. A HIT code means this cache has the object and agrees to serve it. A MISS code means it doesn’t have the object. Thus, the cache now knows who has the object it needs, and, combining this information with other factors such as the round-trip times of each response, can perform the cache selection and make the request via HTTP to its choice. If all the caches replies with MISS packets, it requests the document from its parent cache. An ICP request/reply exchange should occur in a second or two, so the latency increases this time for the browser, but this is usually not noticed by the end user.

If the object requested via ICP is small enough, it can be included in the ICP HIT reply, like an HTTP redirect, but this is not a very common situation.Of course, ICP is needed only in a multi-level cache environment with multiple siblings and parent caches. Using ICP is not in situations like the ones in Figures 1 and 2. When only one cache is involved, or when one cache always requests documents from the same higher-lever cache, ICP would only add unwanted overhead.
 

To cache or not to cache?

At this point, we must realize that not all objects in the web are cacheables. Most FTP files are, as well as most static web pages, but a large number of CGI-generated web pages (dynamic documents) are not. This kind of documents is non-cacheable, because it is different each time you request it. Two good examples of this kind of objects are access counters and live database queries. Caching a reply of a flight reservation system is senseless, since the next query will most likely return more up-to-date values. Other kinds of documents which should not be cached include SSL documents (securely transmited documents).
 

OK, cache, but for how long?

Even if you do not have a proxy-cache server, you must be aware of the effects other proxy-cache servers are causing in the Internet. You may be publishing information in your web server that other caches are storing and serving for more time that you probably want. This is particularly true if you periodically update your site and it’s important to you that a final user never gets out-of-date pages or graphics.

A document in a cache server can have three different states: FRESH, NORMAL and STALE. When an object is FRESH, it is served normally when a request for it arrives without cheking the source to see if the object has been modified since its last retrieval. If it’s in NORMAL state, an If-Modified-Since GET request is sent to the source, so the cache server only downloads the object from the source if it has changed since its last retrieval. A STALE document is no longer valid, and it’s retrieved from the source again.

Normally, when a web server sends a document, it adds an HTTP header called Last-Modified containing the date the object created or last modified. This data is used by cache servers to heuristically calculate how much time may pass for the object to still be considered FRESH. Usually, a proportion of the time elapsed between the date the document was last modified and the date when the document was received is used. A normal proportion is 10%-30% of this time. If this proportion is set to 20%, a document modified ten days before being checked for changes.

Webmasters who frequently updates their information needs more control over the time their documents remains unchecked in web caches. In this case, the Expires HTTP header in the documents served by your server can be used to indicate when this document must be dropped by any cache server. This header explicitly gives the caches the expiration date of a document. A valid RFC1123 time format should be used with this header, for example:

Expires: Mon, 25 Aug 1997 10:00:00 GMT

This header can be generated easily in CGI scripts or the mod_expires module included in Apache 1.2. For example, the following Apache directives (in a <Directory></Directory> or a .htaccessfile) would do it:

ExpiresActive On

ExpiresByType image/gif A432000

ExpiresByType image/jpeg A432000

ExpiresByType text/html A10800

The Expires header is activated for all the subsequent documents with a value of five days for JPEG and GIF images, and three hours for HTML documents.

If you have documents which should be never cached in any server or browser, use the HTTP header called:

Pragma: no-cache

Of course, a cache may expire an object sooner, based on site configuration, lack of free disk space, LRU (Less Recently Used) policies, etc., but it can never cache an object beyond its Expires time.
 

SQUID, the definitive proxy-cache server

Once you have the basic concepts in mind, let’s see how to implement this technology in your organization. A few proxy-server programs are on the market, such as MS-PROXY, aka Catapult, available only for Windows NT, and Netscape Proxy Server, avalaible for different UNIX platforms and Windows NT. Both have two main drawbacks: they are commercial software and they don’t support ICP. The excellent Apache web server has included a proxy-cache module since its 1.2. This module is a very interesting option: it’s free, and works with the most popular web server on the Net. However, it doesn’t use ICP, and its robustness is not comparable to the best choice for a proxy-cache server: SQUID.

Squid is a high-performance proxy-cache server derived from the cache module of the Harvest Research Project, maintained by Duane Wessels. It supports FTP, gopher, WAIS and HTTP objects. It stores hot objects in RAM, and mantains a robust database of objects in disk directories. Squid also supports the SSL protocol for proxying secure connections and has a complex access control mechanism.

Another interesting feature of Squid is negative caching, which saves "connection refused" and "404 Not Found" replies for a short period of time (usually five minutes).

Squid consists in these programs:

It also provides a CGI program, designed to be run through a web interface, that outputs statistics about its configuration and performance and allows some management capabitlities.
 

SQUID installation

Installing Squid is easy, just download the source archive from http://squid.nlanr.net and, in a temporal directory, type:

% gzip -dc squid-x.y.z-src.tar.gz | tar xvf -

Next, compile and install the software by typing:

% cd squid-x.y.z

% ./configure

% make all

% make install

These commands install all needed programs and configuration files to /usr/local/squid. The binary programs are installed in the /bin directory, and the configuration files in /conf. Log files are located in the /logsdirectory, and the object database in the cache directory and its subdirectories. A shell script called RunCache is in the bin directory uses to run the squid binary, and assures that if the process dies for any reason, it is restarted automatically. So, put the following line in your rc.local file:

/usr/local/squid/bin/RunCache &

This will generate an error log in /usr/local/squid/squid.out if Squid could not start because of some configuration problem.

Of course you can choose to install an RPM version of Squid if you use Redhat Linux or another distribution that supports RPM packages.

Squid installs a sample configuration file called squid.conf with many comments for each option. Here you can change the ICP and HTTP ports (3128 by default) and define how much memory and disk space you reserve for caching objects and other parameters such as refresh patterns and access control restrictions. Of course, you need an ICP port only if your cache is going to be sibling or parent of other caches. The directives for changing these values are http_port, icp_port, cache_mem and cache_swap. Additionally, you can set the maximun object size to be stored in the database; the default is 4 MB.

Also, you should uncomment the following line in this file:

cache_effective_user nobody

cache_effective_group nobody

This avoids running Squid as root, a dangerous habit for anyone who runs servers like httpd or gopherd. If you are using a recent version of Squid (at the moment of writing this, the current version is 1.1.16), it will no start running as root, but will write an error message to the squid.out file.

To let Squid use 100Mb of your HD, the directive cache_dir should be something like this:

cache_dir /usr/local/squid/cache 100 16 256

Before starting Squid for the first time, create the cache and logs directories.

To build the cache and hashed subdirectories, you should execute the commands:

cd /usr/local/squid

mkdir cache

chown -R nobody cache

cd /usr/local/squid/bin

./squid -z

Finally, to create and change the owner of the logs directory:

# cd /usr/local/squid

# mkdir logs

# chown nobody logs

Now Squid can be run safely for the first time, with the above RunCache invocation. It will spawn several dnsserver processes and write its pid in the file logs/squid.pid. Important warning or error messages can be found in the squid.out and logs/cache.log files.

Remember, if you want to shut down the cache, you must kill first the RunCache process to avoid an inmediate restart and then type:

/usr/local/squid/bin/squid -k shutdown

Never use kill-9 to shut down the cache, because it doesn’t close the object database in such a way that it can be recovered - you’ll probably lose part of it.
 

Restricting Access to Your Cache

In order to enable only those users who are in your organization to access your cache, you must set up some access control lists (ACL’s). Defining access lists in Squid is quite easy; all access list are defined with a name and are used to define a subset of elements. You can make a subset of IP addreses, protocols, destination URLs and even browser brands. The directive to define an acl or subset is:

acl name type data

You can learn more about ACL types in the example squid.conf. In the case of restricting access to only our users, the type needed is src. For example, suppose you want to allow access to the cache to all browsers in the 172.16.236.0 class C, the first 32 addresses of the next class C and your PC, 172.16.237.180. You can define an ACL like this:

acl my_users src 172.16.236.0/255.255.255.0

acl my_users src 172.16.237.1-172.16.237.32/255.255.255.255

acl my_users src 172.16.237.180/255.255.255.255

Next, define an ACL for the rest of the addresses. This line is included in the squid.conf example file:

acl all src 0.0.0.0/0.0.0.0

Apply this ACLs in an ordered way with the http_access directive. The sintax is:

http_access allow|deny [!]acl> .. <[!]acl

For example:

http_access allow my_users

http_access deny all

More than oneACL can be combined in the same http_access directive and can be used in its negative form (i.e precedeb by !). The example shown is the most simple use of ACLs, but more complex forms

will allow conexions only in designated hours and days, allow only defined URLs or domains to be fetched and restrict some protocols such as FTP. This powerful feature of Squid can help you to enforce and implement your security policy whether you use Squid in your firewall or the Squid machine is the only one allowed to cross your firewall. Just look for examples in squid.conf.

There is also an ACL to permit setting the desired web ports you allow your users to use. This is the Safe_ports ACL. You should uncomment this line and add the 443 port to this ACL in order to allow the use of secure web servers through your Squid server.
 

A Look at the Logs

Squid can generate huge logs of your proxy-cache usage. With this information and the help of some scripts, we can generate complete access statistics, like the one generated from web servers. Squid maintains three main log files:

of expired objects. served. It also includes information of each ICP query received by the cache from other

servers using this server as a neighbour.

The format of this file is:

Time Elapsed Host Status/HTTP Size Method URL Ident Hier_Status/Host

Many utilities are available for generating statisics from the access_log file (see Resources).

Remember, it is not considered ethical to surf your access_log to see which places your users visit. Some sites have choosen not to publish processed statistics in any form to guard their users’ privacy, which is an important concern for all of us involved in the Internet community.

The logs grow very quickly and in a few days can eat up your remaining disk space. To safely clean your log files, you should rotate them with the SIGUSR1 signal. A single line can be added to your crontab to begin new log files every night:

/usr/local/squid/bin/squid -k rotate

This command will create the files access_log.0, store_log.0 and cache_log.0, and begin logging to new empty log files. Now you can safely remove these files or process them for statistical purposes. The next time you rotate the logs, files.0 will be moved to files.1 and so on. You can configure how many extensions Squid will use for these rotations to save disk space with the logfile_rotate n directive in the squid.conf file.
 

Configuring Browsers to Use Cache
Figure 4
To begin using your new proxy-cache server, you must first instruct your user’s browsers to fecth objects from your server instead of retreiveing them directly. In most modern web browsers, one of the configuration options is the specification of the proxy setup. Another option is to specify a list of domains or URL patterns which must be fetched through the proxy.

In Netscape Navigator or Communicator, you can include a proxy server and its port for each service to be proxied. With Squid, you can use this settings for the HTTP, Security (SSL), FTP and WAIS services, all with the same port (3128, by default). First, select the "Manual proxy configuration" radio button, and with the "View" button, to type in your settings. Figures 4 and 5 show examples of these screens.
Figure 5

Another solution is the Automatic Proxy Configuration, introduced since Netscape Navigator 3.0, that allows multiple proxy servers, backup servers and different servers by domains. This configuration sits in a Javascript-like file that must be retrieved from a server. Using it, you can change the topology of your cache mesh or introduce new servers that must be treated as "No proxy for" servers. Without telling your users to change their configurations, the new configuration script is reloaded each time the browser is launched. MS Internet Explorer has also supported the automatic proxy configuration feature since version 3.02.
Figure 6

An example of this kind of configuration for Netscape Navigator and Communicator is shown in Figure 6. In this example, each time the browser is started, it loads the file proxy.pac from the server intranet.mec.es.
 

This file must be returned with MIME-type application/x-ns-proxy-autoconfig which can be accomplished in two ways:

application/x-ns-proxy-autoconfig pac         AddType application/x-ns-proxy-autoconfig pac

For the changes take effect, you must name your proxy auto-configuration file with .pac extension and restart your web server.The Netscape documentation will tell you about the sintaxis of the .pac file (see Resources).

Nevertheless, we’ll look at a couple basic examples of how to write them.

No HTML tags should be embedded in the Javascript file, just the function FindProxyForURL with arguments URL and host. This function should return a single string containing DIRECT (get the object directly from the source), or PROXY host:port (get the object through this server and port). The string can contain more than one of these directives, separated by semicolons. For example:
 

function FindProxyForURL(url, host)

{

return "PROXY proxy1.mec.es:3128; PROXY proxy2.mec.es:80; DIRECT ";

}

will instruct the browser to use the first proxy to fetch the object. If it can´t contact the first (proxy1), then it will try the second (proxy2); in the case that both are down, it will fetch the object from the source. This gives a fault tolerance level to our cache system.

One interesting feature is using different proxies for different domains and including support for internal servers where we don’t want to use the cache. For example:

function FindProxyForURL(url, host)

{

if ( isPlainHostName(host) || dnsDomainIs(host, "intranet.mec.es"))

return "DIRECT";

else if (shExpMatch(host, "*.com"))

return "PROXY proxy1.mec.es:3128";

else

return "PROXY proxy2.mec.es:80";

}

This function will directly fetch all objects whose URL is only a word with no dots or the Intranet server, all .COM objects from proxy1, and the rest from proxy2.

As a tip, the .pac file can be generated "on the fly" by a CGI script, giving different proxy configurations for different browsers, e.g., depending on the REMOTE_HOST environment variable provided by the CGI interface. In this way, load balancing between different networks can be achieved.

Always remember that the MIME-type returned by the CGI must be application/x-ns-proxy-autoconfig.
 

Joining a Hierarchy

If your cache is to be part of a cache mesh or your proxy server is toi be connected to another proxy that will be its parent, you must use the cache_host directive. You must include one line for each of your neighbours. The sintax for this line is:

cache_peer hostname type http_port icp_port options

where:

You can specify the option default to use this host as a last resort in case you can’t speak ICP with your parent cache. Another option is the weight=N to favor a specific parent or sibling in the neighbor selection algorithm. Larger values give higher weights.

If you have an stand-alone cache, you should no include any of these directives. If you have one parent that runs its HTTP port on 3128 and its ICP port on 3130, the line to include in the squid.conf file is:

cache_peer your.parent.cache parent 3128 3130

With the cache_peer_domain directive, you can limit which neighbors are queried for specific domains. For example:

cache_peer_domain your.first.parent.cache .com .edu

cache_peer_domain your.second.parent.cache .es .it .uk .fr

will query the first cache only for the .COM and .EDU domains, and the second one for some of the European domains.

If you have only one parent cache, the overhead of the ICP protocol is unnecessary. Since you are going to fetch all object (HITs a MISSes) from the parent, you can use the no_query directive in the cache_peer directive to send HTTP queries to only that cache.

Also, there are some domains you will always want to fetch directly rather than from your neighbours. Your own domain is a good example. Fetching objects belonging to your local web servers from a faraway cache is not efficient. In this case, use the always_direct acl command.

For example, in our organization we use:

acl intranet dstdomain mec.es
always_direct allow intranet

to avoid getting our own objects from the national cache server.
 

The Cache Manager

Squid includes a simple, web-based interface called cachemgr.cgi to monitor the cache performance and provide useful statistics, such as:

Be sure to copy the cachemgr.cgi program installed in your /usr/local/squid/bin (or whatever you choosed) to your standard CGI directory, and point your browser to

http://your.cache.host/cgi-bin/cachemgr.cgi

There, you should type your cache host name, usually "localhost" or the name of your system, and the port your cache is running, usually 3128, and check all the options.
 

Conclusions and Tips

A proxy-cache server is a necessary service for almost any organization connected to the Internet. In this article, we have tried to show the whys and hows to implement this technology, and a brief tutorial on

Squid, the most advanced and powerful tool for this pourpose. Don’t forget to read all the comments in the example configuration file. They are complete and useful and show a lot of features not mentioned in this article.

Perhaps, in a few years, with the growth of the PUSH technology and the use of dynamic content on the Web, caching won’t be a solution to the bandwidth crisis. Today, it’s the best we have.

One problem proxy caches don’t solve is making certain your users configure their browsers to use the caches. Users can always choose to bypass your proxy server by not configuring their browsers. Some organizations have choosed to block port 80 in their routers except for the system running the proxy-cache server. It’s a radical solution, but very effective.

Another thing you can do to improve the speed of your users’ browsers is pre-fetching the most accessed web sites from your cache. Recursive web-fetching tools which support proxy connections can help do this task in non-peak hours, e.g., url_get, webcopy. Launching one of these retrieval tools with the standard output redirected to /dev/null updates the cache with fresh objects.
 
 

Resources

http://squid.nlanr.net http://squid.nlanr.net/Squid/FAQ/FAQ.html http://ircache.nlanr.net/Cache/ICP/ICP-id.txt mailto:squid-users-request@nlanr.net with "subscribe squid-users" in the body. mailto:squid-announce-request@nlanr.net with "subscribe squid-announce" in the body. http://home.netscape.com/eng/mozilla/2.0/relnotes/demo/proxy-live.html           - Cache Stats, by Iain Lea: