feb2002.tar

Web Publishing with Perl Objects

Reinhard Voglmaier

The Internet has grown exponentially in the past ten years, and along with technical improvements have come changes to the way we work. First, a few people may have worked on a whole Web site, but now sometimes a whole team of Web editors may work on only a little part of an enterprise portal.

A standard life cycle for Web pages has evolved in our company. Web pages initially live only on a developer's desktop and are eventually added to a test site. After the updated pages are approved, they're put into production and the whole site is then in a constant state of change. This cycle of Web page development -- from test to production -- continues as long as the site is on the Internet.

In our environment, many Web editors produce material for our Web sites, and it would be unwise to give them all full access to our Apache servers. Furthermore, our internal review process for Web pages requires an official (a publisher) to approve every page before it goes online. The sheer volume of material flowing to the site requires some degree of automation so that the few full-access Web managers do not have to manually post every new page.

We have developed the following system to help meet the requirements of our process:

1. Web editors post their completed pages to a test Web site.

2. Supervisory Web publishers review and edit the completed pages at the test site. When the page meets the publisher's approval, he or she executes a script that places the page in the queue for the next production update.

3. An automated script with full access to the Apache environment posts the approved pages to a production Web server at regular intervals.

I described this environment in a previous article ("Simplifying Web Production", Sys Admin magazine, May 2000: http://www.sysadminmag.com/documents/sam0005h/). Since then, the network has become much more complicated, and our original approach proved too simplistic. Some of the changes we've made to system are:

1. The sites hosted on one of the test servers can be transferred to different production servers.

2. The server receiving the site does not have to be a Web server -- it could be a staging server.

3. A number of different test servers can put their pages on one production server.

4. The system can work across firewalls.

In this article, I will describe these changes to our original system, particularly the framework for moving Web sites from a test environment to a production server. Note that our site has grown considerably since we originally envisioned the system. We now have several different test or staging servers on an intranet and also on the Internet. One requirement of the system is that it now must be able to move production pages to any server inside or outside the firewall.

The Basic Idea

The centerpiece of this system is the code that transfers Web content from a directory on a testing server to a directory on one of the production servers. The idea behind the content transfer between the server directories is to create an agreement between the receiving and the sending server that contains what exactly is going to be transferred, how this data has been produced, and where it has to go. It models the same process our Webmaster previously did by hand -- the supplier of the Web site sends the pages he produces and the Webmaster installs them. The server producing the data to be installed on the partner server sends this information along with the data to be installed. With this information at hand, the receiving server knows the site for which the data is intended, where he has to install it, and how to install it.

I initially had only a test and production environment in mind, but I am now also using this solution for other applications (e.g., replication servers or staging servers designed to collect parts of Web sites produced from different suppliers).

After we send data and instructions from server A to server B, we can also send instructions for after the page installation (e.g., instructions to launch the indexing process of the search engine or other type of Web site maintenance tools).

The Design

The system consists of three parts:

A GUI
A database
Procedures to let the GUI "speak" with the database

The GUI is the interface to the user, so it is the visible part of the application. It offers users a list of pages and let's them choose which site to put into production (Figure 1). The GUI also offers administration options including the ability to add users, modify user configuration, and add new sites or servers.

The database holds all data about users, sites, and permissions and the procedures execute the various requests made by the GUI. The procedures ensure that the data in the database is updated as needed. See the sidebar "MVC Paradigm".

In our case, the separation developed naturally:

The presentation happens on the client browser produced by a number of CGI scripts on the server side.
The data structure is fixed in a database.
We needed a way to access the database from the CGI scripts, and this happens via a number of methods offered from a library of objects.

Perl offers object-oriented features as well as a large number of libraries ranging from CGI to database management modules. However, you can rebuild the whole framework in any other language. It would be very interesting, for example, to rewrite the whole thing in Java or Python.

DBM, RDBMS, or XML?

The first approach, as described in my previous article, used DBM as a repository. DBM exists on almost every platform (including Win32 systems). DBM is available for free from the Free Software Foundation (http://www.fsf.org/software) or Sleepycat (http://www.sleepcat.com). An RDBMS would be useful if you handle multiple sites with thousands of publishers and editors. In the approach described in this article, I used an XML-based database to hold all the necessary information. I chose XML because it is ASCII and, as such, human readable. You can also record new information without using the administrator interface or, if your database gets corrupted, you can quickly identify and correct damaged data.

Here is our Web site information:

<WebSite Name=www.gsk.it>
<DocumentRoot>/disk1/htdocs/gsk</DocumentRoot>
<TransferProcedure>TransferServer1</TransferProcedure>
<LastModified>10.10.2001</LastModified>
<Modifier>admin</Modifier>
</WebSite>

Information about our users, their roles, and where they work:

<User UserId=webuser1>
<Roles>
                    <Role>Administrator</Role>
                    <Role>Publisher</Role>
</Roles>
<WebSites>
                    <WebSite>www.gsk.it</WebSite>
          <WebSite>www.ricercaesanita.it</WebSite>
          <WebSite>www.someinternalesite.italy</WebSite>
</WebSites>
<User>

Information about the Web sites:

<WebSite Name=/medical/projects</WebSite>
<ServerName>www.gsk.it</WebSite>	
<LastModifed>01.08.2001</LastModified>
<Modifier>user22</Modifier>
</WebSite>

We also need tools to access this data. I wrote a public interface describing the necessary functionality and then the private access functions. The application doesn't know about the private methods, so they can be changed without changing the application.

Here is the user object:

1     sub new  {
2               my ($class,%args) = @_ ;
3               my $self = {
4                                   UserId      => $args{UserId},
5                                   CName       => $args {CName},
6                                   SurName     => $args{SurName},
7                                   Roles       => $args{Roles},
8                                   WebSites    => $args{WebSites}
9               };
10              bless($self,$class);
11              return($class);
12    }

Line 2 puts the first argument, the class name, in the class variable and the rest of the arguments in a hash table called %args. The hash table is used from line 3 to line 9 to initialize the object's data. The variable $self is a pointer to a hash table containing the object's data. We then create the object with the pointer attached (function bless) and turn it back to the calling procedure. Look at the public interface:

1    sub save {
2             my $this = shift;
3             my $User = _save();
4             #   ....     Some housekeeping can go on here anyway
5    return($User);
6    }

To create a new user:

1    $user = new User(    UserId             => "Admin001",
2                                   CName    => "Reinhard Erich",
3                                   SurName  => "Voglmaier",
4                                   Roles    => "Administrator",
5                                   WebSites => "www.gsk.it",
6                                  ) ;
7    $user->save();

GUI -- The Cool Part

Remember when you wrote code for Motif or Visual C++ with hundreds of lines of code for one single widget? The combination of HTML and Perl gives you the same power, but with a few lines of code plus the ability to run anywhere:

1  #!/usr/bin/perl -w
2  use WSP;
3  use CGI;
4  use Template;
5
6  my $Provider = new WSP();
7  my $Widget = new CGI();
8  my $Template = newTemplate("/d01/web/WSP/templates/Hello.html");
9
10  my $UserId  = $ENV{'REMOTE_USER'};
11  my $User = $WSP->User($UserId);
12
13  $Template->display();
14  if ( ! init_Template($User) ) {
15            die "Could not load application";
16  }
17
18
19  sub init_Template {
20            for my $Sites ( $User->webSites() ) {
21            # Generates Java Script to put the User Data into the form
22            . . . .

We set up some variables until line 12 and then initialize our objects -- Provider (the main object containing all configuration data), Widget (holding the CGI object), and Template at the end (a facility to read a template HTML file). Line 10 gets the UserId and line 11 the User Object via this UsrId. Line 13 displays the empty Template and function init_Template generates a JavaScript that fills the User Data into the HTML file. There's a submit button in the HTML file that calls a procedure that posts the request (more about this later).

This is simplified -- the true application allows you to choose how to proceed with the site prior to production. You could put into production only the files changed since the last site update, or choose a method to update the database of a search engine, and so on. There are also scripts for administration of users, sites, and requests; the logic is still the same. We create the HTML parts via CGI, we display the template, and we put the data into the template using JavaScript.

The Glue -- The Application Logic

The systems knows of the existence of four directories:

Accepted
Scheduled
Executed
Data

When a publisher decides that it's time to update the site, the system creates a file in "accepted". This file contains enough information to permit the system to begin the update process (i.e., the directory to be put into production, the site the directory is living on, the person requesting the update, the type of update, and the date of the request). Because the Web server is running normally at the lowest privilege level (let's assume as "nobody"), the directory "accepted" must be writable for the user "nobody".

There's a process that runs under the UserId of the Web site owner that looks into the accepted directory for instructions. When it finds a file, it writes the file into "scheduled" and puts the data produced into "data" when it's finished. The process then notifies the person who requested the update; additional site updates can be done without putting it into production. It starts a cron job that collects all scheduled procedures and executes them at predefined times.

Controller -- Three Objects

The first object is called "WSP" and not only holds the configuration files but also administers the other two objects (Server and Job). It knows where the previously mentioned directories are living and makes this information available via method calls. As with the database, the information held in the objects is made persistent via XML files.

To see the information held in the WSP file:

<WSP Name=MainWebSite>
<ScheduledDir>/WSP01jobs/scheduled</ScheduledDir>
<AcceptedDir>/WSP01jobs/accepted</AcceptedDir>
<ServerConfigDir>/WSP01Configuration/Server</ServerConfigDir>
<ServerDir>/WSP01Configuration/Server</ServerDir>
<DataDir>/WSP01jobs/data</DataDir>
<LogDir>/WSP01logs</LogDir>
</WSP>

All servers are held in the ServerDir and look like this:

<SERVER Name="gsk">
<DocumentRoot>/d01/htdocs</DocumentRoot>
<SendDataProcedure>SendData()</SendDataProcedure>
<ServerName>gsk</ServerName>
<ConfigDir>/home/projects/WSP/Configuration/Server</ConfigDir>
</SERVER>

These two objects can be considered static, because the information seldom changes. The SendDataProcedure entry is the name of the procedure called to transfer the data from one server to another.

The third object is dynamic because it gets generated when a Web site goes into production:

<Job Name=user1_13>
<WebServer>gsk</WebServer>
<UserId>user1</UserId>
<SName>Voglmaier</SName>
<CName>Reinhard</CName>
<Date>Mon Jul 16 09:11:55 2001</Date>
<Procedure>putInProduction()</Procedure>

Here again all data is self-explanatory, thanks to XML. The user (UserId, SurName, and ChristianName) who wishes to put the site into production, the date of the request, and the requirements to put the site into production (in this case, to execute the procedure putInProduction()), are declared in the object's jobs.

Controller -- The Behavior

The three objects are the heart of WSP inasmuch as they determine the behavior of the whole application. When the user decides to transfer the site, he chooses the site to be transferred from a list of files (Figure 1):

1  my $User = new User($UserId);
2  my @SiteList = $User->SiteList();
3       # code to get the data in the HTML Template as described above
4  my $WSP = new WSP();
5  my $Job = $WSP->createJob();

Let's create a new User object from the UserId. This object contains all the User Data. We need this list to display the sites to be transferred by this particular user. In line 4, the controller springs into life and reads the configuration files to obtain the list of directories it should know about and the servers available. With this knowledge at hand, the WSP objects create a new job object and call the accepted method of the job. The accepted method creates a new file in the "Accepted Directory". Because these actions (caused by the call createJob (in line 5) are executed on behalf the UserId, the Web server is running (normally "nobody"), and the "Accepted Directory" has been writable by this user ("nobody"). This is the point at which security must be discussed.

The rest can be executed as a higher privileged user, normally the owner of the system. Once the job has been accepted, the scheduler does his job:

1  my $WSP = new WSP();
2  while ( $Job = $WSP->getNextAcceptedJob() ) {
3            $Job->scheduledJob();
4            if ( my $Server = $Job->transfer() ) {
5                      # we transfer the data to $Server
6                      ...........

This code creates new WSP Object to get all information about the system. Line 2 gets the first job from the accepted directory and executes the schedule procedure requested from the user. The Job Object knows what schedule procedures exist and how to invoke them. Furthermore, it knows to which server the data should be transferred, opens the corresponding server object, and uses the transfer function defined to the appropriate server. The information of the scheduling process is kept in the "scheduled" Directory to permit a control process to understand the state of the job.

Everything is now finished for the sending server. However, the site is still not in production. The same framework is now installed on the receiving server. The receiving server could forward the data on another server, or keep it and work as a staging server. Look at the "final" receiving server that updates the Web site:

1  my $WSP = new WSP();
2  while ( my $Server = $WSP->sourceServers()) {
3            $Data = $WSP->transferData();
4            $Instructions = $WSP->transferInstructions();
5            if ( $Instructions->execute($Data) ) {
6                      # eventually we could execute Control Functions
7                      # or Security Controls here
8              . . .   we're done !!!

It looks very simple because we reused what we had constructed for the sending server. The WSP object knows which servers could have sent data and how to get the data. This could be a simple opening of a directory containing the data; however, the data could be on a staging server or on the sending server (e.g., if the administrator of the sending server has decided not to transfer the data and to let the receiving server get the data instead). After data is transferred, it will be installed using the instructions from the sending server. The instruction object is created again, or transferred, or only reading a file just transferred. The instructions are then executed using the data objects previously created.

Current Development

The framework actually is installed and working as described above. Two important features are now going to be installed: security and control. All the objects described in this article will have two additional methods: security() and notification(). These methods are called at certain times of the life of an object -- when the object is created and when an important action is executed on behalf of the objects (e.g., when the job is transferred or a new job is created). This will permit callbacks to be executed in certain situations and a well-defined context.

When the job object is created, security is called with the job object and the reason as arguments. You can define a callback procedure that is called using these two arguments after this notification is called, again with the two parameters as above. You can register a notification callback that could, for example, send an email to the user who made the request. If the user contests the request, you can invoke a method on the job to cancel the job. The job cancel event can launch a notification to the administrator for permission for the action. Updates of these new features will be uploaded to my home page (http://voglmaier.getmyhomepage.com/).

Conclusion

This article described a framework to take Web sites from a test environment to the production site. It is not limited to this particular situation, but can be used for a wide range of requirements. Furthermore, it does not impose a determined server architecture and it could even be used if the server is based on a RDBMS (such as Oracle's WebPortal). Although you have to export and to re-import the database to transport data from one site to another, this works and you maintain only one framework. The framework can fit all security requirements because you determine which security features you will implement in the system.

References

Berkeley DB -- http://www.sleepycat.com

GDBM -- http://www.fsf.org/software

Lincoln Stein CGI -- http://stein.cshl.org/WWW/CGI

Reinhard E. Voglmaier HomePage --
http://voglmaier.getmyhomepage.com/

Samba -- http://www.samba.org

Sys Admin magazine, "Simplifying Web Production" by Reinhard Voglmaier, May 2000 -- http://www.sysadminmag.com/documents/sam0005h/

WebDav -- http://www.webdav.org

Reinhard Voglmaier studied physics at the University of Munich in Germany and graduated from Max Planck Institute for Astrophysics and Extraterrestrial Physics in Munich. After working in the IT department at the German University of the Army in the field of computer architecture, he was employed as a Specialist for Automation in Honeywell and then as a UNIX Systems Specialist for performance questions in database/network installations in Siemens Nixdorf. Currently, he is the Internet and Intranet Manager at GlaxoWellcome, Italy. He can be reached at: rv33100@GlaxoWellcome.co.uk.