Cover V09, I06
Article

jun2000.tar


Link Management System

Reinhard Erich Voglmaier

There are many good books discussing the different phases of creating a Web site, such as Web planning, Web design, and so on. Reading these books leads the Webmaster into an ideal world and returning to the everyday job may then be frustrating, because the Webmaster finds herself facing a poorly designed and perhaps even more poorly planned Web site. This is because most Web sites are grown over time. As a site grows, the initial lack of planning is felt more strongly. When attempting to move a whole directory, it can seem as if the site either must remain as it is or will need to be redesigned from the ground up. Neither approach is the best one; the former will keep the site in the unorganized state, and the latter will waste a lot of good work.

I needed a tool that allowed me simply to move a number of directories to another location without ending up in a nightmare of links. Furthermore, there are a number of references from external sites that may need to be notified of such a reorganization in order to keep their links up to date. Thus, I developed the Link Management System, which keeps the internal links up to date and notifies registered clients if the referenced document is moved. The source code of the Link Management System is available from Sys Admin's Web site:

http://www.sysadminmag.com
as well as from my company's Web site:

http://www.GlaxoWellcome.co.uk
The whole application is written in Perl.

The Basis

The underlying idea behind the Link Management System is to save information about which document is referenced where and to update all links when a document moves to a new directory. For example, if I change /webmaster/utility with all files and subdirectories to /system/admin/tools, the program will ensure that all the links to:

http://webmaster/utility/some_link.jpg
change to:

http://system/admin/tools/some_link.jpg
With this goal in mind, I designed the Link Management System around a concept that is used in the password management. I call this a shadow Web. Every document on the real Web site has a counterpart in the shadow Web site. The path is identical except for the prefix of the root. For example the document /html_documents/webmaster/utility/index.html has its counterpart in the shadow Web document: /html_shadow/webmaster/utility/index.html. Every shadow document contains (instead of the document content) the name of the documents referring to it. Let's see a dump of the abovementioned /html_shadow/webmaster/utility/index.html:

Link: /research/ITtools/search.html

Link: /HR/telefone/index.html

Link: /news/IT/SearchUpdate.html

Link: /news/IT/Problems.html

HREF: http://www.anothersite.it/programs.html

e-mail: admin@www.anothersite.it

The first four lines say that the documents contain one or more links on the counterpart of the shadowfile, that is, /html_documents/webmaster/utility/index.html. The last line remembers that there's an external site pointing to this file as well, and it would like to be informed (via email) if the location should change. More on this later on.

Now it's very fast to find the documents referring to each document (i.e., as fast as the access to the document itself). The whole application is written in Perl, so not only is it easy to install, but it's also platform independent.

Construction of the Shadow Web

The heart of the Link Management System is the shadow Web, as it contains all the information about the link structure and internal referrals, as well as external ones. The construction of the shadow Web is straightforward. A Perl script scans the whole Web site, parsing each document to understand its link structure. For each link found in the documents, two actions are performed. First, the procedure checks whether the referenced document really does exist. Second, the procedure opens the shadow document of the referenced document and generates an entry containing the physical path of the referencing document.

Link Maintenance

Once we have an updated shadow Web site, we can move our documents (images, presentations, HTML files...) without losing the links to other documents. Or, we can work on our Web site, adding new documents, deleting obsolete ones, or just changing them. Let's look at an example. Suppose I want to move the Webmaster's home page:

http://www.glaxowellcome.it/webmaster
to:

http://www.glaxowellcome.it/homes/webmaster
There are probably some pages on my company's Web site that refer users to the Webmaster for problems. For example, “If you have problems seeing this page look at: http://www.glaxowellcome.it/webmaster where you may find contact information”. So, I must update all links in all pages referring to my home pages. There may be other Web sites referring, for example, to my script upload directory that need to be informed of the change. To do this, the system opens the shadow file of the file to be moved, updates the links contained in the pages referring to the page to be moved, moves the pages, and sends (via email) a change notification to the external sites referring to the moved file. That's all!

The more frequent case is that of an update. The case of single documents being updated is easy. The new document is compared to the old one, links added are first controlled for existence, and then updated in the shadow Web. Links being removed are removed from the shadow Web also (i.e., the file describing the link is updated and removed). If a document will removed, the application generates an error message if there's still another document referring to it, otherwise it is removed.

Modification of entire directories is more complex. The easiest solution is to put the update into production and to run the procedure again to generate the shadow Web. This can be time consuming, but can be done at night. A more sophisticated procedure is actually under construction. This solution supposes that the intranet Web site has a test site and a production site. Only the production site is under the control of the Link Management System. If the test site has been updated, it can be put into production. At that time, two actions must be performed: update the shadow Web, and update the link structure. The update procedure does the following. It looks at the shadow Web to see whether any file referenced does not exist anymore. If there is a file referenced that does not exist, an error message is generated and human intervention is needed. The next step is to look at what file has been changed (timestamp), and how the link structure has been modified (new links, old links removed), and then the links for the new files must be updated in the shadow Web.

External Referrals

As mentioned above, there are two sections in the files on the shadow Web. The first one describes internal links, the second describes external ones. I discussed how internal referrals get into the file, and external ones are handled differently. Each external site sends the information about which pages are referenced from where. Since this rarely happens, the procedure does not handle it, and the information that other sites are subscribed to the actual document is inserted manually.

The data exchange between different Web sites could be achieved in the following way. For each external document being referenced in a document a line an XML file is generated. This XML file can be processed later by another program that decides what to do with it. The most common action is to send the information to the other link management system. The other system then takes the appropriate action. The same thing happens if the site is being updated, but this feature has yet to be implemented in the program.

Performance Questions

Clearly, the construction of the shadow Web takes time, and the movement of a whole directory requires updating a lot of files. Sometimes it means updating a single file several times, because several links contained in the file need to be updated, or the file is first updated and then moved. Let's see what can be done to address these problems.

The construction of the shadow Web takes a lot of time because we continue to open and close files. The following is an example from one of our production Web sites. The whole Web site occupies 2.3 GB; the indexing is done in 20 minutes and produces a shadow Web site of 21 MB. How could we speed this up? I observed that many files are opened and closed several times. Keeping all files open isn't possible, so I had to maintain a sort of cache for the file accesses and access the files only when the cache is full. The cache is organized as an associative array of lists. The key to the associative arrays is the names of the shadow files; the values are lists of the documents referencing the original files.

Another problem regards the movement of a whole directory structure. Let's look again at an example. Suppose I am moving the directory /webmaster to /homes/webmaster. First, I move index.html. In the same directory, there is a file manual.html pointing to index.html. Thus, I must update manual.html if I move index.html. Since manual.html is moved later on, I have to again change the link that was changed before in manual.html.

A strategy module is helpful here. The strategy module has a link plan, which contains details about what needs to be moved where. So, consulting the link plan the first time manual.html is updated, the other links in manual.html are also updated. The strategy module also helps in other ways. When the links are updated, the program must decide how to refer to the new location. There are two ways to express a link inside the same Web server: absolute links and relative links. Absolute links are links beginning from the document root, such as /webmaster/manuals/ \
java/images/folder.jpeg
. A relative link would be ../images/folder.jpeg. In constructing a link, the system has two possibilities to reference other documents: by an absolute link or by a relative link. The strategy module decides which way to do it. At the time of the writing, the strategy module is not yet finished and exists only in the most primitve version: the user can configure the maximum amount of traversal executed with relative links. For example, three maximal parent traversals means that ../../../images is allowed, for ../../../../data an absolute link is used.

Future Development

For now, the system only notifies remote Web sites if the documents for which those sites registered a change notification have been modified. The notification is standard email. In the next version, it will be in XML format. The next version of the system also will contain a subsystem that reads notification in XML format and automatically changes the Web site.

Conclusion

The Link Management System is an easy to install, easy to maintain, and platform-independent application for maintaining a Web site. It not only allows you to move documents and entire directories inside the same Web site, but also to notify other Web sites of the modification of the local URLs. The whole application is written in Perl. Parts of the application have been recycled in a Link Control System. These parts are the traversal of the Web site and the control of the existence of the referenced files.

About the Author

Reinhard Voglmaier studied physics at the University of Munich in Germany and graduated from Max Planck Institute for Astrophysics and Extraterrestrial Physics in Munich. After working in the IT department at the German University of the Army in the field of computer architecture, he was employed as a Specialist for Automation in Honeywell and then as a UNIX Systems Specialist for performance questions in database/network installations in Siemens Nixdorf. Currently, he is the Internet and Intranet Manager at GlaxoWellcome, Italy. He can be reached at: rv33100@GlaxoWellcome.co.uk.