Cover V05, I10
Article
Listing 1

oct96.tar


Local News - Exploring Web-based Threaded Discussion Server Functionality

Lars Magnusson

Although I installed my first Web server in 1994, I didn't pay much attention to Web technology itself until last autumn when I was at a convention and saw a demonstration by Swedish Telecom, Telia. They showed an application for their Net documentation - a text repository able to check in new or changed texts. It was not as streamlined as Tim Berners-Lee's original Web system, but it was quite functional.

Unfortunately, Telia wasn't interested in releasing the CGI script they used, partly because they felt it was not ready. But it inspired a lot of the conference participants and gave me an excuse to take a closer look at what Web applications really could do. There was a lot of discussion about text repositories going on at my primary employer, a municipal council, so it was natural to look into the possibility of replicating Telia's application.

Since I did not have access to a Web server with CGI capability at that time, I tried using the posting capability instead, and I found that making a user friendly repository with that method was not possible. This was mainly because the present HTML standard does not support a predefined Subject-line in a post, which is necessary to get reference information into the return post.

I was drawn into a discussion about FirstClass, a Canadian conference system that used a proprietary protocol. Naturally, the idea arose to look at a local Web-based conference server, with an interface not unlike Netscape 1.x's Netnews. (See Listing 1, lnews.sh.)

After exploring FirstClass, I found I now had some code, including an old mailserver (see "An Internet Gate for a Local Email System," Sys Admin Jan/Feb 1995: 49-56), that could be used. And, with the post option in Web forms, I gained much of the control over the flow that I previously lacked. I was then faced with five general problems to solve; I needed to:

1. Restore the mailed form to a structured text, handling both older Netscape 1.x string data as well as newer 2.x MIME-code data.

2. Transform the HTML/MIME codes for special chars, especially the Swedish chars (most of which are vowels-very important in getting words correct).

3. Generate a message-id in form of a index number

4. Build the resulting HTML page

5. Create a theaded index page, similar to that of Netscape Netnews

Restoring Mail and Transforming Special Characters

Submissions sent via the Web-forms interface come in as conventional email, potentially manipulated by several intervening mail transport agents. To restore the email back to a formatted letter, I needed to resolve other problems:

1. Netscape version 1.x and older delivers form data in a continous string, whether it is through CGI or a post.

2. Both rmail and mailx/Mail break lines at 80 characters.

3. Netscape 2.x delivers some posted data in a MIME-format, so I had to decide how to identify the type by looking in the mailheader for the MIME header.

Pre-Netscape 2.x

To allow use of versions of Netscape prior to 2.x, I was faced with both dividing the string at certain points and joining lines elsewhere. Most UNIX tools do not handle strings longer than 1024 characters, so I wanted to keep lines as short as possible. In examining this problem, I also found that the HTML code for CR/NL, %0D%0A often got broken by the 80-character border.

So, I started breaking the lines that could be broken, and joining the lines that should be joined. Then, I had to break "joined" lines. Only then could I substitute the special characters. The following segment of code accomplished the required task:

sed 's/%0D%0A/\
/g
s/&/\
/g'  infile  |  nawk '{
if (index($0,"\\") != 0) {
oldlin=substr($0,1,length($0)-2)
getline newlin
print oldlin""newlin
} else {
print $0
}
}' | sed 's/%0D%0A/\
/g
....... # rest of substitutions follows

After experimenting, I used awk to join the lines, as shown above, and used sed for the substitutions. I could have used a single awk script, but I got better performance by piping it through one sed script, then through awk, and finally to the last sed script.

Post-Netscape 2.x

Having a MIME-encoded reply from a user with Netscape 2.x simplified some of the process, because the filter for special characters was much less complicated. But I still had to extract the subject-line, the letter, and the message-id. By postulating that the replies layout was controlled by the form and not by the viewer, I extracted the line numbers of the MIME field-headers. These were then used to make three sed scripts that cut up the reply, stored the subject and message-id in their respective shellvars, and stored the main text in a temporary file.

After all this, I finally had a readable letter - text that was easy to process further. I then retrieved some additional information from the reply. These values were also stored in shell variables for future use. Since the mail address handling seems to be different for different viewers, I paid some extra attention, parsing mail addresses and usernames from those standard formats I know of.

Generation of the Message-id

In Netnews, relationships between articles are maintained by adding a special header line, with the parent article's message-id written into it. If it's a new thread, then the header line is empty. It is simple to make this message-id, but it demands large resources when doing the threaded hierarchy later.

I wanted a simpler, less demanding solution, allowing rapid threading of hundreds of articles. I therefore chose to connect the message-id to the filename, so they reflected each other and thus simplified threading.

I chose to create a hierarchical numbering, in which a first post had a single number, say, 7. Then replies would have numbers 7.1, 7.2, replies to replies 7.2.1 and 7.2.2 and so forth. This gave me the same well-defined structure as an outline.

To separate the groups, I gave every group a beginning character from A to Z. It limited the present version to 26 groups, but that can easily be changed by using double letters or some other pattern. By combining the group char and the number and adding a suffix [.html], I had a filename I could use. However, since I started the design under MS-DOS with MKS Toolkit, I was faced with the limitations of MS-DOS filenames (e.g., only one dot allowed in the filename). I substituted all but the suffix dot with a underscore, so in reality the filenames look like B7_2_3.html instead of B7.2.3.html, while the message-id still says B7.2.3. This is only a minor inconvenience though. The MS-DOS limit of eight characters in the leading portion of the filename also limits the number of postings that the system is capable of handling.

Note that msg-id is the message-id for the parent posting. When a discussion participant posts a new thread or a reply, the message-id for the new posting is unknown. lnews has to calculate it when the post arrives by adding one to the last discussion thread number in the appropriate group level. With help of find, I retrieve the last valid post in the actual level by extracting the parent posting message-id from the reply and using it as value for the search. The find action looks like this:

find dir( -name "msg-id.html" -o -name "msg-id_*.html" -a

! -name "msg-id_*_*.html") -print | tail -1

If this is a new thread, the parent-id is always 0, forcing a special find search to return the last main level number. Otherwise, the program looks for the the parent-id and for already processed replies in the main level, but not the underlying levels. The ! -name directive negates those subsequent levels searches, so they don't intervene. Then it reads the last name found, which will be used to calculate the new message-id.

Choosing Perl for this function would have simplified the calculation and string handling. But, I wanted to try standard UNIX functions, so I used find, awk, and expr.

Building the Resulting HTML Page

I wanted to find a style for the HTML page that was as simple and easy-to-read as possible. I dislike not seeing either top or bottom on long Web pages, so I used the Web form, which allowed having the header and footer visible at the same time. The text was in a 14-line scrollable area in the middle of the page. The buttons in the footer made it easy to skip dull postings, and the small scrollable area was not too uncomfortable.

But, using a form for reading also meant that the user was always in posting mode. I showed this to some colleagues, and they felt that being in reading as well as posting mode at the same time was confusing. So, I separated the text into a Netscape v.1.x Netnews-like reading page and a form-based posting page. The latter was actually the original page, minus some buttons.

I generate the reading page with the suffix .html and the posting page with .xhtml. This means there are two pages for every post, which is both a disadvantage and an advantage. The disadvantage is the increased disk usage - having two pages stored. The advantage is that while the posting form could be generated "on the fly" using CGI, this solution gives a faster response. I have found that most CGI-generators are quite slow. Given present economical disk prices, the additional storage may be more a matter of taste than a real design problem. Java or Inferno might handle this issue more effectively.

Page Layout

The design of the pages was done by making a simulation of what I wanted with a Web editor. When I had the correct layout, I took the HTML code for the page and transformed it into print settings in the two awk scripts generating the pages. I also experimented with Netnews-buttons in Netscape, but because the newsreader is a proprietary addition to Netscape, I decided not to use them. Instead, I created some submit buttons in a form, with the actions I wanted written on them. I then took a screenshot, cutting out the different buttons to JPEG-files. I have had some problems with alignment in the form page and still need to clean up the jpegs a bit, but otherwise this method worked.

Lastly, I had to import some variables, among those, the shell variables I saved in the beginning. Because I originally had some problems transferring shell variables into awk when using MKS Toolkit, I saved these in a temporary data file. This file is read by the subsequent awk scripts, assigning data to the appropriate variables in the scripts. For portability, I have retained this construction. The variables are the poster's mail address and name, the subject, the calculated message-id for the post, the http-tag for the conference main directory (used in the button HREFs), and the mailing address for the conference maintainer.

I used the http-tag because I felt it would be too much work to change the value in every tag in the lnews script when porting the script. By using shell variables, the value needs to be changed at one place in the script. Site-specific data will always be assigned at only one place.

Make a Threaded Index

After using tin to read news for some years, I am very fond of getting related postings grouped together. When I saw how the Netscape designers solved this for their newsreader in version 1.x, I felt it was a natural evolution from tin. Although tin's paged approach is neater, with the command menu always available at the bottom, the Netscape index gave a good overview.

With the numbering scheme discussed earlier, I had a lot of the hierarchical structure already in place. But I still needed the program to sort the major numbers in reverse (the highest on top), so the newest post would be seen first as in Netscape. But, a reverse sort would put the replies before the first post in the thread, and I didn't want that. The solution proved to be simple enough as shown in the following code:

find "/$WEBDIR/$GRP" -type f -name "*.html" -print |
nawk 'BEGIN{ FS="/" }{print substr($NF,2)}' |
sort +0nr |
nawk '{print "'$WEBDIR'/'$GRP'/'$GRP'"$0 }' |
xargs egrep "From: |Subject: " |
nawk '{ .....

First, the program lists the .html files in the directory where the particular conference resides. It then filters out the path and the conference character. Then, the list is sorted in reverse for the first position and sorted normally for the following positions. The full path is then reconstructed. When the sorted list is in order, the program retrieves the poster's name and the subject from the post, and makes an index line that is concatenated to the new index page. A header and a footer is then added to the page. This index page is reconstructed for every new post to the particular conference.

Administration and Problems

For now, the script runs as a mail filter, activated every time a post is delivered to the mailbox. The alias file contains the line: lnews: "| /usr/local/bin/lnew". I have also tried to run this under cron, but I prefer the mail filter approach. However, if there are many posts, the cron solution might ease the system load somewhat and allow queueing, as well.

Note that with some mailers the process of checking the mailbox with cron leaves some status letters to the script owner. Therefore, run the script under another UID, so these messages don't foul up the script. Also remember to remove the mailbox for that UID, so you don' t run out of disk space, since you need to activate the script every minute.

Site-specific information is stored in the beginning of the shell script, separated into site dependencies and OS dependencies. Some changes may be required in these sections to reflect the platform the script is running on. And, a word of caution: the find code in the message-id generation was created under Solaris 2.x and has only been tested under Solaris 2.x and MKS. The same goes for the threading function in the index generation. The sort +0nr used is running as described under Solaris, but not under MKS Toolkit or Linux. By including the conference letter and having a small awk script doing a ls of each main thread, it is possible to mimic the Solaris behavior. An example will be found at my Web page http://www.abc.se/~m8827/prog.htm.

Security

I use a mail filter, doing active handling of headers and text only with basic UNIX tools. I have never heard about anyone able to break them in this context. The main risk command, expr, is never in direct contact with user input. Therefore, most of those risks associated with CGI scripts and Internet should be neutralized. I would be happy to hear from readers if they see security holes in this approach.

Conclusion

When I started the project, my objective was to test the limits of Web scripts. That exploration showed that there is much to be done by way of adding core functionality to Web scripting. Although the resulting software is functional, with mail filtering I cannot deliver specific information to a specific user, as with CGI. But much of the processing done at a Web site can probably be done this way. For example, a company or organization low on funds could rent some Web pages at a Webhotel, putting up forms that direct the mail back to a small Linux box at the office. A mail filter like this would then take appropriate action, maybe even mailing back requested information. It could also be on a dial-up connection, serving customers/members just like a fancy, direct-connected Web server.

About the Author

Lars has a B.Sc. in Geology from Gothenburg University in Sweden (1980) and a M.Sc. in Petroleum Exploration from Chalmers University of Technology, also in Sweden (1984). He has worked with mining geology/geophysics in Sweden and Greenland, but since the mid-1980s has worked as sysadmin, system manager, and teacher in Sweden, Greenland, and Denmark. At present, he is working as consultant within the areas of UNIX, Internet, and email.