Chapter 20 -- Gateway Programming Fundamentals
Chapter 20
Gateway Programming Fundamentals
CONTENTS
Multipurpose Internet Mail Extensions (MIME) in the CGI Environment
HTTP Information about the Server That Does Not Depend on the Client Request
From Client to Server to Gateway and Back
How the Server Passes the Data to the Gateway Program
Code Sample: The Print Everything Script
Manipulating the Client Data with the Bourne Shell
Manipulating the Client Data with Perl
To Imagemap or Not to Imagemap
An Integrated E-Mail Gateway Application
Discussion of the Resumé Application
Extending the Transaction: Serial Transmission of Many Data Files in One Transaction
Gateway Programming Fundamentals Check
Chapter 19 laid the groundwork for practical
CGI programming. Now it is time to focus on the essentials of
gateway web development: how to use CGI environment variables
and how to manipulate standard input to receive and process the
client request. The goal, in broad terms, is to create a CGI program
that builds a response and prefaces it with a necessary MIME header.
This response is highly flexible; it can be HTML, another data
type, or it might build another form for the client to fill out.
Recall the thematic HyperText Transfer Protocol elements of openness
and extensibility throughout the discussion.
Perl and the Bourne Shell are used to explain the fundamentals
of environment variables, MIME types, and data-passing methods.
I then present practical Perl and Bourne Shell scripts to illustrate
these points.
Multipurpose
Internet Mail Extensions (MIME) in the CGI Environment
The novice web developer's bane is the failure to pay attention
to the strict MIME requirements the HTTP imposes on the client
request-server response cycle.
When a client request arrives via a METHOD=GET
or METHOD=POST (refer to
Chapter 19 for introductory remarks on
these methods) and a CGI program executes to fulfill the request,
data of one form or another is written to standard out (stdout-the
terminal screen if the program is run as a stand-alone program)
and then sent by the server to the client. The very first print
statement must output a string of this form:
Content-type type/subtype <line
feed> <line feed>
Perl uses \n as the line
feed escape sequence, for example, and therefore must start output
of plain text or HTML with this statement:
print "Content-type: text/html \n\n";
The type text refers to the
standard set of printable characters; historically, the subtype
plain is defined. On the
Web, html is an additional
subtype-plain text with HTML formatting tags added. Web clients
can handle the formatting of HTML directly.
Note that the first \n escape
causes the first line feed to go to line 2 of the output, and
the second \n escape ensures
a completely blank second line.
The next little "Hello, World!" Bourne Shell script
demonstrates the MIME header requirements without the benefit
of a Perl \n escape sequence.
In the Bourne Shell, the echo
statement is the brute force line-feed method:
#!/bin/sh
echo "Content-type: text/html"
echo
echo "<HTML>"
echo "<HEAD><TITLE>Hello</TITLE></HEAD>"
echo "<BODY>Hello, World!</BODY></HTML>"
Caution
If the second line of the CGI script's output is not completely blank, the script will not run. If the developer is confronted by code that is syntactically correct, runs on the command line, but dies swiftly and mysteriously in a Web environment, a
malformed MIME header might be the culprit. See "Code Debugging," later in this chapter, for more details.
Consider the following equivalent two-line code in the Bourne
Shell:
echo "Content-type: text/html"
echo
Again, the second line of output is blank.
Tip
Be aware of standard Perl toolkits that web developers can take advantage of. The NCSA httpd server distribution includes the useful cgi-handlers.pl, (See note ) which includes the following html_header
subroutine to ensure a proper MIME header:
#
# from the cgi-handlers.pl package
#
sub html_header {
local($title) = @_;
print "Content-type: text/html\n\n";
print "<html><head>\n";
print "<title>$title</title>\n";
print "</head>\n<body>\n";
}
This handy subroutine accepts an argument that forms the title of the HTML response, outputs the required MIME header, inserts the title within the HTML <head> and </head> tags, and then outputs the HTML tab
<body>. The body of the response follows.
Other important type/subtype pairs are worth mentioning: image/gif
is decoded inline by all Web graphical clients; image/jpeg
is not universally decoded inline. In UNIX, the important file
~/.mailcap (the ~/
prefix means that this file is in the user's home directory) is
the map between MIME extensions and external executable files
that can handle the corresponding multimedia extension. Here is
a sample ~/.mailcap file:
audio/*; showaudio %s
video/mpeg; mpeg_play %s
image/*; xv %s
application/x-dvi; xdvi %s
If the image/jpeg is not
decodable inline by the client directly, for example, the .mailcap
file is referenced. The line starting with image/*
is found and the corresponding viewer, xv, is spawned with the
file name as its argument. Note that external viewers spawn processes
that are independent from the client Web session. After the Web
session terminates, external viewers still may be active. Plugins,
popularized by Netscape in 1996, are quite different animals than
external viewers. Adobe, for example, makes external viewers (Acrobat
Reader and Acrobat Exchange, for example) for Portable Document
Format (PDF) files, and starting with its Amber (Version 3.x)
product line, its viewer now is integrated with the Netscape browser
as a plugin. Plugin viewers extend the browser's functionality;
in the case of Amber, PDF files can be viewed inline (integrated
in the browser window with an additional Adobe toolbar). Plugins
are not defined in the ~/.mailcap
file; they ship with their own idiosyncratic installation instructions.
The CGI programmer must have a good understanding of the set of
available environment variables.(See note)
When the client sends a request and the gateway program executes,
the CGI programmer has access to the full set of environmental
variables. These variables fall into two broad categories:
The first type is independent of the client request and has the
same value no matter what the request. These values are properties
of the server that also are known as server metainformation.
The second type does depend on the client request. Most of these
are client-specific, but some do depend on the server to which
the request is being sent.
CGI programs sometimes rely on the contents of some of these variables
to fulfill the client request. Other variables are not essential
to logical processing but can be manipulated and echoed back to
the user for cosmetic or informational reasons. Examples of both
scenarios are given in this chapter. In bimodal.pl,
the variable $ENV{'REMOTE_USER_AGENT'}
is queried to determine the interface type. After that, I illustrate
environmental variables serving a useful purpose in a Perl-to-e-mail
gateway.
Here are important examples of both types. You can look on-line
for a discussion of the full range of environmental variables;
the definitions that follow also come from the on-line NCSA documentation.
HTTP
Information about the Server That Does Not Depend on the Client
Request
AUTH_TYPE If
the server supports user authentication and the script is protected,
this is the protocol-specific authentication method used to validate
the user.
CONTENT_LENGTH The
length of data buffer sent by the client. The CGI script reads
the input buffer and uses the CONTENT_LENGTH
to cut off the data stream at the appropriate point.
CONTENT_TYPE For
queries that have attached information, such as HTTP
POST and PUT,
this is the content type of the data.
GATEWAY_INTERFACE The
server CGI type and revision level. Format: CGI/revision.
HTTP_USER_AGENT The
browser that the client is using to send the request. General
format: software/version library/version.
PATH_INFO As
you saw in Chapter 19, extra path information
can be communicated by client by the following:
METHOD=GET(POST) ACTION= http://machine/path/progname/extra-path-info
The extra information is sent as PATH_INFO.
PATH_TRANSLATED The
server translates the virtual path represented in PATH_INFO
and translates it to a physical path.
QUERY_STRING The
information that follows the ? in the URL that referenced this
script. This variable was introduced in Chapter 19
as a technique to pass data to the CGI program.
REMOTE_ADDR The
IP address of the client.
REMOTE_HOST The
client host name. If the server does not have this information,
it should set REMOTE_ADDR
and leave this unset.
REMOTE_IDENT If
the HTTP server supports RFC 931 identification, this variable
is set to the remote user name retrieved from the server. Using
this variable should be limited to writing to the log file only
(be careful not to compromise unwittingly the privacy of the user).
Caution
It is very dangerous, for performance reasons, for the web server administrator to turn on RFC 931, also known as ident. Granted, developers and administrators often are curious about identifying users accessing the web site. Ident adds
an extra preliminary chat step between client and server, however, and only if the client is running ident and the server is the user ID identified. Empirically, this occurred on the EDGAR server for less than 10 percent of the accesses in July
and August 1994. Worse, according to Rob McCool (formerly of NCSA Mosaic's development team, now at Netscape Communications Corporation), the use of ident on the server side can cause great headaches to clients hiding behind corporate firewalls.
The preliminary conversation, in which the server queries the firewall in an attempt to identify the client, confuses and even might hang those clients. By way of anecdotal evidence, I have noticed during my reign as Web master at the NYU EDGAR development
site that several large corporate clients did suffer inexplicable delays when my server's ident was on.
REMOTE_USER If
the server supports user authentication and the script is protected,
this is the authenticated user name.
REQUEST_METHOD The
HTML form uses a METHOD=GET
or a METHOD=POST; these two
are the most likely ones that the CGI programs have to face.
SCRIPT_NAME A
virtual path to the script being executed, used for self-referencing
URLs such as ISINDEX queries.
SERVER_NAME The
server's host name, DNS alias, or IP address.
SERVER_PORT The
port number to which the client request was sent. Recall that
port 80 is the http standard.
SERVER_PROTOCOL The
protocol that the client request is using: HTTP 1.0 or the more
recent HTTP 1.1. Format: protocol/revision.
SERVER_SOFTWARE The
name and version of the Web server. Format: name/version.
The test-cgi Bourne Shell
script from NCSA displays some of these variables, as shown in
Listing 20.1.
Listing 20.1. The NCSA test-cgi
Bourne Shell script.
#!/bin/sh
echo Content-type: text/plain
echo
echo CGI/1.0 test script report:
echo
echo argc is $#. argv is "$*".
echo
echo SERVER_SOFTWARE = $SERVER_SOFTWARE
echo SERVER_NAME = $SERVER_NAME
echo GATEWAY_INTERFACE = $GATEWAY_INTERFACE
echo SERVER_PROTOCOL = $SERVER_PROTOCOL
echo SERVER_PORT = $SERVER_PORT
echo REQUEST_METHOD = $REQUEST_METHOD
echo HTTP_ACCEPT = $HTTP_ACCEPT
echo PATH_INFO = $PATH_INFO
echo PATH_TRANSLATED = $PATH_TRANSLATED
echo SCRIPT_NAME = $SCRIPT_NAME
echo QUERY_STRING = $QUERY_STRING
echo REMOTE_HOST = $REMOTE_HOST
echo REMOTE_ADDR = $REMOTE_ADDR
echo REMOTE_USER = $REMOTE_USER
echo CONTENT_TYPE = $CONTENT_TYPE
echo CONTENT_LENGTH = $CONTENT_LENGTH
Figure 20.1 shows the result of the test-cgi
environmental variable report.
Figure 20.1 : Sample output from NCSA's test-cgi Bourne Shell script.
Server-side includes (SSIs) use special extensions to HTML
tagging.(See note) SSI
files look like HTML; they use the HTML tagging conventions. They
are not quite the same as regular HTML files, however. I mention
them here because they make interesting use of a superset of CGI
environmental variables. They aren't strictly part of CGI programming,
because HTML document preparers can use them without interfacing
with a gateway program.
The best way to understand SSI directives is to look at a simple
example of the SSI tags, tools.shtml,
as shown in Listing 20.2.
Listing 20.2. The tools.shtml
code.
<title> Filing Retrieval Tools
</title>
<A HREF="http://edgar.stern.nyu.edu/formco_array.html">
<h2> Company Search </a></h2>
<A HREF="http://edgar.stern.nyu.edu/formlynx.html">
<h2> Company and Filing Type Search </a></h2>
<A HREF="http://edgar.stern.nyu.edu/formonly.html">
<h2>Form ONLY! Lookup</A></h2>
<A HREF="http://edgar.stern.nyu.edu/form2date.html">
<h2>Form and Date Range Lookup </A></h2>
<A HREF="http://edgar.stern.nyu.edu/current.html">
<h2> Current Filing Analysis </a> </h2>
<A HREF="http://edgar.stern.nyu.edu/mutual.html">
<h2> Mutual Funds Retrieval </a></h2>
<A HREF="http://edgar.stern.nyu.edu/EDGAR.html">
<img src="http://edgar.stern.nyu.edu/icons/back.gif">
Return to Home Page</a>
This toolkit was last modified on <!--#echo var="LAST_MODIFIED"
-->
<!--#include virtual="/mgtest/" file="included.html"
-->
Note that the document in Listing 20.2 has the odd extension of
shtml. This is because my
server is configured to recognize shtml
as a file containing SSI tags. When my server receives a request
to show a file with SSI directives, it must parse the document
into HTML; only then is it returned to the client. Thus, the parsing
represents a performance hit that the client must suffer. The
upside is that the included information is dropped in on-the-fly
at request time. The web developer should note that the Web master
must take the necessary steps beforehand to configure the server
to understand SSIs (enabling them in selected directories and
defining a magic extension such as *.shtml
that alerts the server to expect the extension tags). It would
be a poor idea to enable SSIs on all *.html
files because the server would have to parse every *.html
file served (a big performance hit).
What does the tools.shtml
file do? Before the server returns this document to the client,
it parses the SSI directives. There are two such directives in
Listing 20.2. The first,
<!--#echo var="LAST_MODIFIED"
-->
instructs the server to resolve the variable LAST_MODIFIED
and echo it in place. The second,
<!--#include virtual="/mgtest/"
file="included.html" -->
is a directive to the server to include the file included.html
in the HTML output, and the virtual
tag tells the server that the directory alias mgtest
should be suffixed to the document root.
Figure 20.2 shows the client's view of tools.shtml after it is
parsed by the server.
Figure 20.2 : The client requests tools.shtml; the server parses the serve-side includes and returns HTML.
It is possible to include, at request time, other information,
such as a file size (substitute #fsize
for #include in Listing 20.2).
The following variables (not part of the core set of CGI environment
variables) also are available to be displayed via the echo
directive:
DATE_GMT The
current date using Greenwich Mean Time.
DATE_LOCAL The
current date using the local time zone.
DOCUMENT_NAME The
current file name.
DOCUMENT_URI The
virtual path to the document (starting from the server's document
root).
LAST_MODIFIED The
last date and time that the current file was "touched."
If you want to display the modification date of included.html,
for example, the following directive would do the trick:
<!--#flastmod virtual="/mgtest/"
file="included.html" -->
QUERY_STRING_UNESCAPED The
unescaped QUERY_STRING environment
variable sent by the client.
Caution
Server-side includes can be very dangerous. If the Web master defines html as the SSI extension, every HTML file will be parsed prior to returning to the client-a huge performance hit. SSIs pose no special security risk (no more so than CGI
scripts, as long as the site administrators are aware that non-traditional CGI directories now are launching CGI scripts), but you must consider their potential to drag down the site's performance before you use them.
Another (rather improbable) danger is the infinite loop. If I construct a file (let's call it loop.shtml), and somewhere in that file, include the line
<!--#include virtual="/mgtest/" file="loop.shtml" -->
the file loop.shtml is dropped in within loop.shtml, again and again, ad infinitum-a recursive loop.
The web developer should make an independent judgment when weighing the performance loss of SSIs against the utility of showing useful information such as the file modification.
Perl, C-Shell, Bourne Shell, and other UNIX command shells are
all interpreted scripting languages. They generally start with
#!<path>/<binary-executable>
If there is uncertainty about where the interpreter (for example,
Perl) resides, the following UNIX command will locate it:
which perl
Perl often is installed by the superuser in the /usr/local/bin
directory. Thus, Perl programs at many installations start with
#!/usr/local/bin/perl
and shell programs usually start with
#!/bin/sh
Thereafter, the scripts are checked one line at a time by the
interpreter for syntactic correctness. They run slower than compiled
code (for example, C or C++), but if the underlying data is well
organized, even multimegabyte datastores can be managed effectively.
Caution
The web developer must know how the Web site administrator has configured the server's capability to execute CGI scripts. Only a few directories are eligible to run CGI scripts; alternatively, the server might allow CGI programs to be in all the
HTML
directories. In other words, it is insufficient to turn on the execute bits in UNIX, check the syntax, and hope that the script runs. If a script is in an invalid location, the server might output an Authorization Failed message or, worse, it
might die silently. Furthermore, the file extension often is critical. It is a common configuration of NCSA servers to recognize extensions of *.csh (C-Shell), *.pl (Perl), *.sh (Bourne Shell), and *.cgi (generic CGI
scripts) as legitimate CGI scripts. Some servers-for example, Netscape's-default to allowing only *.cgi as an executable extension. This is another argument to (1) make friends with your system administrator, and (2) avoid oddball script file
extensions.
In gateway programming, it is easy to envision the script returning
simple lines of formatted output in response to a client's data
request. The reader should keep in mind, however, that scripts
just as easily can output valid HTML that the server will return
to the client. A client therefore can go directly to the URL of
a gateway program, which then executes and displays HTML on the
client screen. This might be a form that posts data to yet another
script (I demonstrate this technique in Chapter 21's
discussion of the company-stock ticker application). Or, the script
program is gathering important information about the client and
outputs the appropriate HTML, as I show later in this chapter
with the bimodal.pl example.
Although Perl or C generally are the languages of choice for a
budding developer, some people might not have access to Perl or
might find C difficult to learn.
To further demonstrate the basics of the various methods of sending
and receiving data between the client and CGI program, I start
with simple Bourne Shell examples. The Bourne Shell, sh,
is available on all UNIX boxes (well
it should be!) and
these examples easily are adaptable to almost any other environment
that has a batch command-line processing language and/or a shell
with environment variables.
From
Client to Server to Gateway and Back
A developer needs to understand three areas in client-to-server-to-gateway
communication:
How a client can send data
How the server can pass that data to the gateway program
How the gateway can send data back to the server and then back
to the client
The two basic means for the client to send data through the server
to the gateway program are via the URL and the message body (in
a METHOD=POST form). It is
much more common for the client to use METHOD=POST
but it is important that the web developer be familiar with all
the routes. Passing data via the URL sometimes is necessary (in
ISINDEX keyword searches)
and sometimes a good idea, perhaps even with METHOD=POST.
To send data via the message body, use a form with METHOD=POST.
This passes the data to the gateway program via the program's
stdin. The CONTENT_LENGTH
environment variable is set to the number of characters being
sent; the CONTENT_TYPE variable
is set to application/x-www-form-urlencoded.
Passing data via the URL has several variations:
A URL with ?[field]=[value]+[field]=[value]
such
as
http://www.some.box/cgi-bin/name.pl?FirstName=Bill+SecondName=Elmer
is equivalent to the browser sending data to the server via a
form and the METHOD=GET request,
because the equal signs are unencoded. An encoded = sign is the
character string %3D; the
hexadecimal representation for the = character is 3D.
A URL with ?[data]
with no displayable = characters Even if there
are encoded = characters (that is, %3D
in the URL), the server treats this as an ISINDEX
query. For example,
http://www.hydra.com/cgi-bin/sams/nothing.pl?20
http://www.hydra.com/cgi-bin/sams/nothing.pl?chapter%3D20
both are treated as ISINDEX
queries. Recall that an ISINDEX
query usually is a keyword search using a text engine such as
WAIS or freeWAIS; the general form of this request follows:
http://machine/path/text-gateway-script.pl?keyword1+keyword2+keyword3+...
Note that the ISINDEX query
passes data via the command line. Unlike other methods of passing
data, ISINDEX data is not
encoded by the server before it is passed to the gateway program.
No special decoding is necessary. Note that the + character, separating
the keywords, was not encoded into its hexadecimal equivalent
of %2B.
Tip
Although it is possible to create an HTML file with the <isindex> tag, there is no point; it will do nothing because an ISINDEX query is self-referencing (it calls itself). In other words, an ISINDEX screen should be
generated by the script that also includes the code to perform the query.
A URL with extra path data With
this method, immediately following the gateway program name, information
is appended in the format of a data path:
http://www.hydra.com/cgi-bin/sams/nothing.pl/Bill/Elmer/
After the server finds the gateway program, it puts everything
that follows into the PATH_INFO
environment variable. With the preceding URL, PATH_INFO
contains /Bill/Elmer/.
Before the Server Passes the Data Encoding
With the exception of ISINDEX,
the data first is encoded by the server: spaces are changed to
plus signs (+), certain keyboard
characters are translated to their hexadecimal equivalent (represented
as %[hex
equivalent]) (for example, a !
becomes %3D), and fields
within forms are concatenated with &.
As an example, if a form contains:
Field 1<INPUT NAME=FIELD1> Field
2<INPUT NAME=FIELD2>
and data such as 1 !@#$%
and 2 ^&*()_+| are input
for fields one and two, respectively, the server encodes the data
into the following string:
FIELD1=1+%21@%23%24%25&FIELD2=2+%5E%26*%28%29_%2B%7C
Notice that
The fields are separated by the unencoded &.
With each field, an unencoded =
separates the field name input form and the data.
Spaces within the field data are translated to +.
Certain other keyboard characters are encoded, as mentioned, to
%[hex equivalent].
The protocol designers decided to use readable characters only
in the encoding scheme for clarity and ease of use; no high-end
ASCII (unprintable) characters can appear.
How
the Server Passes the Data to the Gateway Program
After the server receives the data, it has three ways to send
that data to the gateway program:
Via the gateway program's stdin. If the REQUEST_METHOD
is post, the server first
encodes the data as described previously and then sends it to
the gateway as stdin. In a UNIX Shell, you can simulate this on
the command line by creating a file with the data and running
the script as this:
$ test-cgi.sh < test.data
It is important to note that there is no end-of-file terminating
the data. The CONTENT_LENGTH
variable is set to the number of characters in the data stream
automatically by the HTTP protocol, and the script must include
code to read only that amount of data from the stdin datastream.
Via the command line. In Perl, the statement
read(stdin, $input_line, $ENV{CONTENT_LENGTH})
properly puts the stdin data into the variable $input_line
as a command-line argument without encoding. The REQUEST_METHOD
is GET and the server recognizes
the incoming data as an ISINDEX
query. The server passes the data onto the gateway program as
a command-line argument without encoding the data. This is the
same as running the script on the shell command line as the following:
$ test-cgi.sh arg1 arg2 arg3 . . .
Via the server's environment variables. Recall
the discussion of environment variables at the start of this chapter.
Any variables set by the client also are passed along by the server
to the gateway program. To test the script on the command line
with environment variables, the variables first must be set. How
this is done depends on the type of shell being used. In the Bourne
Shell, for example,
$ QUERY_STRING=FNAME\=foo\&LNAME\=bar
$ export QUERY_STRING
$ echo $QUERY_STRING
$ FNAME=foo&LNAME=bar
sets the QUERY_STRING variable
to FNAME=foo&LNAME=bar
for testing with a script. Note that a user, when sending a browser
to a URL of the form http://www.some.box/cgi-bin/test.pl?foo,
is setting the QUERY_STRING
variable to foo. Similarly,
the URL http://www.some.box/cgi-bin/test.pl/foo
sets the PATH_INFO variable
to foo. Often, the developer
will test GET methods via
a browser instead of operating on the command line.
Code
Sample: The Print Everything Script
To aid the developer in understanding how data flows between the
client, server, and gateway, Listing 20.3 shows a simple script,
in both Bourne and Perl, for testing the various data-passing
methods.
Listing 20.3. A Bourne Shell script to demonstrate GET
and POST methods.
#!/bin/sh
echo "Content-type: text/html"
echo
progname=print-everything.sh
action=cgi-bin/bourne/$progname
if [ $# = 0 ]
then
echo "<HEAD><TITLE>The Print Everything Form</TITLE><ISINDEX></HEAD><BODY>"
echo "GET form:"
echo "<FORM METHOD=GET ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM>"
echo "POST form:"
echo "<FORM METHOD=POST ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM></BODY>"
case "$REQUEST_METHOD" in
GET) echo "You made a GET Request<BR>"
;;
POST) read input_line
echo "You made a POST
Request passing:<BR>"
echo " $input_line<BR>"
echo "to <I>stdin</I><BR>"
;;
*) echo "I don't understand the REQUEST_METHOD:
$REQUEST_METHOD<BR>";;
esac
else
echo "<HEAD><TITLE>The Print Everything Form</TITLE><ISINDEX></HEAD><BODY>"
echo "GET form:"
echo "<FORM METHOD=GET ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM>"
echo "POST form:"
echo "<FORM METHOD=POST ACTION=/$action>"
echo "Field 1<INPUT NAME=FIELD1>"
echo "Field 2<INPUT NAME=FIELD2>"
echo "<INPUT TYPE=submit VALUE=SUBMIT>"
echo "</FORM></BODY>"
echo "This is an <B>ISINDEX</B> query:<BR>"
echo "and you input: $*"
fi
echo "<PRE>"
echo "REQUEST_METHOD: $REQUEST_METHOD"
echo "Command line arguments: $*"
echo "QUERY_STRING: $QUERY_STRING"
echo "PATH_INFO: $PATH_INFO"
echo "</PRE>"
echo "<HR>"
echo "back to <A HREF=$progname>Print Everything</A><BR>"
Run this script, and the screen shown in Figure 20.3 appears.
Figure 20.3 : The input screen for the print_everything.sh script.
The reader might want to try this script with input such as the
following:
In the browser's Document URL input field, follow these steps:
Put extra path info after the URL.
Put [field]=[data] after
the URL.
Put [data]%3D after the
URL.
Put [data] with either
= or %3D
after the URL, and put data into the GET
or POST input fields and
click Submit for that field.
Add other environment variables to the output screen.
After trying different types of input or modifying the script,
the developer should have a better feel for how the server looks
at the incoming data.
To begin your transition to Perl, Listing 20.4 shows a version
of print_everything in Perl.
Listing 20.4. The print_everything.pl
code.
#!/usr/local/bin/perl
#
# print_everything.pl
#
print "Content-type: text/html\n\n";
$progname = "print_everything.pl";
$action= "cgi-bin/bourne/$progname";
if(@ARGV == 0){ print "<HEAD><TITLE>The
Print Everything Form</ÂTITLE><ISINDEX></HEAD><BODY>";
print "GET form:";
print "<FORM METHOD=GET ACTION=/$action>";
print "Field 1<INPUT NAME=FIELD1>";
print "Field 2<INPUT NAME=FIELD2>";
print "<INPUT TYPE=submit VALUE=SUBMIT>";
print "</FORM>" ;
print "POST form:";
print "<FORM METHOD=POST ACTION=/$action>";
print "Field 1<INPUT NAME=FIELD1>";
print "Field 2<INPUT NAME=FIELD2>";
print "<INPUT TYPE=submit VALUE=SUBMIT>";
print "</FORM></BODY>"
;
if($ENV{REQUEST_METHOD} eq "GET")
{ read(stdin, $input_line, $ENV{CONTENT_LENGTH});
print "You made a GET
Request<BR>";
print "passing: $input_line<BR>";
print "to <I>stdin</I><BR>"
;
}
elsif($ENV{REQUEST_METHOD} eq "POST")
{ read(stdin, $input_line, $ENV{CONTENT_LENGTH});
print "You made a POST
Request passing:<BR>";
print " $input_line<BR>";
print "to <I>stdin</I><BR>"
;
}
else
{ print "I don't understand
the REQUEST_METHOD: $REQUEST_METHOD<BR>";}} #end Âargv
if test
else # in case command-line argument(s) given
{
print "<HEAD><TITLE>The
Print Everything Form</TITLE><ISINDEX></HEAD><BODY>";
print "GET form:";
print "<FORM METHOD=GET ACTION=/$action>";
print "Field 1<INPUT NAME=FIELD1>";
print "Field 2<INPUT NAME=FIELD2>";
print "<INPUT TYPE=submit VALUE=SUBMIT>";
print "</FORM>" ;
print "POST form:";
print "<FORM METHOD=POST ACTION=/$action>";
print "Field 1<INPUT NAME=FIELD1>";
print "Field 2<INPUT NAME=FIELD2>";
print "<INPUT TYPE=submit VALUE=SUBMIT>";
print "</FORM></BODY>";
print "This is an <B>ISINDEX</B>
query:<BR>";
print "and you input: @ARGV ";
}
print "<PRE>";
print "REQUEST_METHOD: $ENV{REQUEST_METHOD}\n";
print "Command line arguments: @ARGV\n";
print "QUERY_STRING: $ENV{QUERY_STRING}\n";
print "PATH_INFO: $ENV{PATH_INFO}\n";
print "</PRE>\n";
print "<HR>";
print "back to <A HREF=$progname>Print Everything</A><BR>";
exit;
If the developer wants to see all the environmental variables,
not just those related to the Web transaction, it is a simple
matter in Perl, as Listing 20.5 shows. Figure 20.4 shows the output
of this script.
Figure 20.4 : The output of the dump_vars script.
Listing 20.5. dump_vars:
A short Perl program to list the environmental
variables.
#!/usr/local/bin/perl
#
# dump_vars : dump all the (sorted by name) Enviromental
Variables
# formatting
them nicely in HTML
#######################################################
print "Content-type: text/html\n\n";
print "<ul>"; # an unordered bullet
list
foreach (sort keys %ENV) {
print "<li> Env Var key: $_
value $ENV{$_}";
}
print "</ul>"; # end the bullet list
exit 0;
Think about Figure 20.4 for a moment. Why are only Web-related
environmental variables showing up, when I specified that the
entire %ENV array be listed?
The answer goes back to the fundamentals; the environmental variables
shown belong not to a specific user, but to the CGI process owner
(typically nobody). The process
owner had no environmental variables set up before the script
started-hence, the minimal list you see in this figure.
A gateway program must begin its output with a proper header that
the server will understand. The server recognizes three headers
(at this time):
Content-type: [type]/[subtype] This
was discussed at the beginning of this chapter. For the most part,
the developer will be using the following in Perl:
print "Content-type: text/html\n\n";
Location: [URL] This
causes the server to ignore any trailing data and perform a redirect-that
is, it tells the client to retrieve the data specified by the
URL as if the client originally had requested that URL. The code
print "Location: http://www.some.box.com/the_other_file.html\n\n";
for example, causes the server to tell the client to retrieve
the_other_file.html. Here
is a brief Perl script that takes advantage of the Location
header:
#!/usr/local/bin/perl
$filename = 'ls -t /web/updates/ | head -1';
print "Location: http://www.some.box/$filename\n\n";
exit;
In this sample, the value of the $filename
variable is the most recently modified file in the specified directory.
Using the Location header
directs the client to retrieve that file, even though the client
has no prior knowledge of which file that is.
Status: [message string] This
causes the server to alter the default message number and text
specified that it normally would return to the client:
print "Status: 305 Document moved\n";
Note that only a certain range of numbers is valid here: 200-599.
Anything else causes an error.
Note
No-parse header scripts are gateway programs in which file names historically began with nph-; newer servers have dropped that requirement. The server does not parse or create its own headers; it passes the gateway output directly to the client
untouched. The gateway output must begin with a valid HTTP response:
print "HTTP/1.0 200 OK\n";
print "Content-type: text/html\n\n";
One reason why a developer might want to use nph- scripts is that, because the gateway doesn't parse the output, the client receives a response quicker. Of course, other factors could affect the response time. Another reason to use nph-
scripts is if you want to display a series of images or text strings serially to the client, each one overlaying the previous item (a poor man's animation); this is the last code example presented in this chapter.
Manipulating
the Client Data with the Bourne Shell
The Bourne Shell is great for doing UNIX-specific activities,
but is a weak tool for web development because it lacks the text-manipulation
facilities of Perl. As an example, Listing 20.6 presents a simple
Bourne Shell script, called by a METHOD=POST
form, that separates the fields into shell variables.
Listing 20.6. A Bourne Shell script to handle METHOD=POST.
#!/bin/sh
echo "Content-type: text/html"
echo
echo "<HEAD><TITLE>Display Form Variables</TITLE></HEAD>"
read buffer
echo $buffer > /tmp/awk.temp.$$
awk ' {elements = split($0, fields, "&")
}
{print "number of elements
= " elements}
{print "<P>"}
{ for (elements in fields)
{
junk = split(fields[elements], value, "=")
printf
"value of record = %s", value[2]
print
"<BR>" }
} ' /tmp/awk.temp.$$
rm /tmp/awk.temp.$$
echo "<BR>"
The output from this script still will be encoded. UNIX programs
such as sed or tr
can be used to decode the data, and the gnu
version of awk (gawk)
does have a substitution function. Things are getting a bit unwieldy
at this point, however, and the developer does not need to reinvent
the wheel. There are easier ways to accomplish these tasks: with
Perl.
Note
If you're unable or unwilling to use Perl, a package that allows you to access and decode form variables and still use shells such as Bourne does exist. The Un-CGI package decodes form variables and places them in the shell's environment
variables. (See note)
Manipulating
the Client Data with Perl
As you can see from Listing 20.6, there's a bit of work to be
done before the developer can get to the client's data and accomplish
real tasks.
Fortunately, Larry Wall created the Practical Extraction and
Reporting Language (Perl). Perl looks like C but subsumes
a lot of features originally found in utilities such as sed, awk,
and tr. Although it doesn't allow you to get as close to the system
as C, it is an excellent choice to quickly develop complex CGI
programs. Perl's strength is precisely what most CGI programs
need-powerful and flexible text-manipulation facilities. For these
reasons, Perl has become a popular software choice for CGI programming.(See note)
To decode a variable in Perl, for example, you can use code such
as the following (from cgi-handlers.pl):
tr/+/ /;
s/%(..)/pack("c",hex($1))/ge;
These two simple lines decode all the encoded characters in a
string variable in one step.
This completes the discussion of the CGI fundamentals. Now I'll
move onto real-life code that illustrates how the simpler pieces
fit together to form useful applications.
To
Imagemap or Not to Imagemap
Users without access to full graphical Web interfaces often use
line browsers such as Lynx or W3. Imagemaps do not appear on line
browser terminals; the word [IMAGE]
appears in Lynx, but it is not clickable. Therefore, it is important
to cater to the Lynx users of the world when developing an imagemap
front end. How do you distinguish Lynx and its peers from the
Mosaics and Netscapes of the world? This will become clear when
I show you bimodal.pl, which
uses a little environmental variable trick.
Code Walkthrough: bimodal.pl
The program bimodal.pl is
so named because it offers two modes: an imagemap and a standard
textual link interface. It queries the environmental variable
ENV{HTTP_USER_AGENT} and
switches to the mode appropriate to the user's browser. If a line
browser such as Lynx is detected, it would be inappropriate to
display an imagemap. The Lynx user would be stymied with an imagemap;
the image would display as [IMAGE]
and there would be no clickable region. Imagemap therefore would
be functionally useless to a Lynx user. The program outsmarts
these difficulties and reverts to text links in such cases. For
graphical browsers such as Mosaic or Netscape, the imagemap is
displayed. Listing 20.7 shows the bimodal.pl
code.
Listing 20.7. The Perl script bimodal.pl
queries the HTTP_USER_AGENT
environmental variable.
#!/usr/local/bin/perl
#
# bimodal.pl
#
# First things first, supply the MIME header
print "Content-type: text/html\n\n";
# If line-browser detected, print the textual HTML. Else,
# user has a GUI browser and I use the imagemap.
if ( $ENV{HTTP_USER_AGENT} =~ /Lynx|LineMode|W3/i ) {
#
print <<EndOfGraphic;
<TITLE> What's for Dinner? - Text version</TITLE>
<H1>What's for Dinner? - Text version</H1>
<A HREF=http://www.some.box/enchilada.html>Enchilada</A>
|
<A HREF=http://www.some.box/hamburger.html>Hamburger</A>
|
<A HREF=http://www.some.box/kabob.html>Shish Kabob</A>
|
<A HREF=http://www.some.box/hotdog.html>Hot Dog</A>
|
<A HREF=http://www.some.box/spag.html>Spaghetti</A>
<BR><HR>
EndOfGraphic
# The label EndOfGraphic is reached. Now
the "else" part of the if
# statement takes over - to present GUI browsers
with an imagemap.
}
else {
print <<EndOfImap;
<title>What's for Dinner? - Graphic version</title>
<H1>What's for Dinner? - Graphic version</H1>
<A HREF="http://www.some.box/cgi-bin/imagemap/dinner.map">
<img src="http://www.some.box/icons/dinner.gif"
ismap>
</a>
<HR>
<A HREF=http://www.some.box/sams/>Index of WDG Web Pages</A>
EndOfImap
}
exit;
Tip
The bimodal.pl script uses a trick common to the original Bourne Shell and Perl that can be very handy when a developer needs to output lots of HTML. The following code prints the HTML block exactly as is until it encounters the terminating string
SomeLabel:
print <<SomeLabel;
<HTML-block-line-1>
<HTML-block-line-2>
<HTML-block-line-3>
<HTML-block-line-4>
,,,
<HTML-block-last-line>
SomeLabel
This technique is very handy because it produces very readable code with a minimum of fuss. The alternative-outputting HTML with multiple Perl print <some-HTML> statements-can cause headaches because special characters within the
<some-HTML> string must be escaped in order to print properly, or, more fundamentally, in order for the Perl program to run without syntax errors. As a simple example, if I want to output the following HTML in a Perl CGI program,
<A HREF="http://is-2.stern.nyu.edu/">The InfoSys Home Page</A>
I can use a Perl print statement and escape the interior quotation marks by using this code:
print "<A HREF=\"http://is-2.stern.nyu.edu/\">The InfoSys Home Page </A>";
Or, I can say
print <<EndHTML;
<A HREF="http://is-2.stern.nyu.edu/">The InfoSys Home Page</A>
EndHTML
Caution
In Perl 5, there is a hidden danger using this technique:
print <<some-label;
HTML-BLOCK
some-label
An unescaped @ character inside the HTML-BLOCK crashes the program. In any Perl version, another trap must be avoided in this construction: The terminating string, some-label, must appear flush left without any leading
white space. Failure to place some-label flush left results in runtime errors, even though it passes a syntax check.
Figure 20.5 shows the result of bimodal.pl
executing from a GUI Web browser-Mosaic 2.5 for X.
Figure 20.5 : Bescause a GUI Web browser is used, bimodal.pl displays an imagemap front end.
Figure 20.6 shows the result of bimodal.pl
executing from a line Web browser-the University of Kansas's Lynx.
(See note) The script bimodal.pl
avoids showing the imagemap, which would have no meaning to a
Lynx user and reverts to a standard textual hyperlink front end
that has the same functionality.
Figure 20.6 : A line browser's view of the Web site shown in Figure 20.5.
Now look at another useful example. Suppose that I want to fetch
one or more documents from a Web server, but only if the modification
date and time (the timestamp) has changed from the last time I
checked it.
Here's how I can do it: I can use Perl to set up a TCP client
socket connection between my machine (the client) and the Web
server and send the server a HEAD
method to get metainformation about the files (specifically, their
timestamps) sent in the socket back to my machine. I then consult
a database of the file timestamps and compare my database to the
newly received information. If they match, the file in question
was unchanged and I take no action. If they don't match, I fetch
the contents of the file to my local machine.
The code in Listing 20.8, get_head,(See note)
follows this scheme. The code to set up a socket connection is
fairly dense but, thankfully, it's all in the important book Programming
Perl, by Schwartz and Wall, published by O'Reilly & Associates,
1991. System V-style UNIX, such as Solaris 2.X and SGI, will need
the file socket.ph to run
this code. Also note the use of the Perl dbmopen
function to keep a database of file timestamps.
Listing 20.8. The get_head
program to demonstrate sockets and the
HEAD method.
#!/opt/bin/perl
#
# get_head : uses HEAD method to test timestamp modification
on a group
#
of remote files,
#
and saves locally those files that were modified since
#
the last time we checked.
#
# First, define some useful HTTP Protocol Status Codes
and Messages
# in two associative arrays.
#
%OkStatusMsgs = (
200, "OK 200",
201, "CREATED 201",
202, "Accepted 202",
203, "Partial Information 203",
204, "No Response 204",
);
%FailStatusMsgs = (
-1, "Could not lookup server",
-2, "Could not open socket",
-3, "Could not bind socket",
-4, "Could not connect",
301, "Found, but moved",
302, "Found, but data resides under different
URL (add a /)",
303, "Method",
304, "Not Modified",
400, "Bad request",
401, "Unauthorized",
402, "PaymentRequired",
403, "Forbidden",
404, "Not found",
500, "Internal Error",
501, "Not implemented",
502, "Service temporarily overloaded",
503, "Gateway timeout ",
600, "Bad request",
601, "Not implemented",
602, "Connection failed (host not found?)",
603, "Timed out",
);
$outfile = "/home/mginsbur/filecontents.txt"; #
we'll append all changed
&nbs
p;
# files to this local file.
open(OUTFILE,">>$outfile") || die "cannot
open $outfile \n";
$baseurlpath = "/usr/local/aries/web/testsock";
$server = "http://edgar.stern.nyu.edu/testsock";
chdir($baseurlpath) || die "cannot chdir to $baseurlpath
\n";
foreach $f (<*.html>) {
print "Processing file $server/$f \n";
dbmopen (%time_stamps,"timedb",0666); #
open the database of timestamps
$status = &Check_URL ("$server/$f");
print "Status: $status\n";
dbmclose (%time_stamps);
}
exit 0;
###################
# Subroutines #
###################
sub Check_URL {
local($URL) = @_;
if ($URL !~ m#^http://.*#i) {
print "wrong format http!\n";
return;
}
else { # Get the host and port
if ($URL =~ m#^http://([\w-\.]+):?(\d*)($|/(.*))#)
{
$host = $1;
$port = $2;
$path = $3;
}
if ($path eq "") {
$path = '/'; } # give a "/"
if none supplied in the path
if ($port eq "") {
$port = 80; } # port
80 is standard
$path =~ s/#.*//; # Delete
name anchor
}
#####################################################################
# The following is largely taken from the 'Programming Perl' book, #
# Schwartz and Wall, on a sample Perl TCP/IP Client: pages
342-344.#
#####################################################################
$AF_INET = 2;
$SOCK_STREAM = 1;
$sockaddr = 'S n a4 x8';
chop($hostname = 'hostname');
($name,$aliases,$proto) = getprotobyname('tcp');
($name,$aliases,$port) = getservbyname($port,'tcp') unless $port
=~ /^\d+$/;
($name,$aliases,$type,$len,$thisaddr) = gethostbyname($hostname);
if (!(($name,$aliases,$type,$len,$thataddr) = gethostbyname($host)))
{
return -1;
}
$this = pack($sockaddr, $AF_INET, 0, $thisaddr);
$that = pack($sockaddr, $AF_INET, $port, $thataddr);
# Make the socket filehandle.
if (!(socket(S, $AF_INET, $SOCK_STREAM, $proto))) {
$SOCK_STREAM = 2;
if (!(socket(S, $AF_INET, $SOCK_STREAM, $proto)))
{ return -2; }
}
if (!(bind(S, $this))) { #
bind locally
return -3;
}
if (!(connect(S,$that))) { # connect remotely
return -4;
}
select(S);
$| = 1;
# unbuffer the i/o because we have 2 filehandles
select(STDOUT);
print S "HEAD $path HTTP/1.0\n\n"; # send
the web server a HEAD request
#print S "GET $path HTTP/1.0\n"; #
could have used a CONDITIONAL GET
#print S "If-Modified-Since: Monday, 03-Jun-96 14:57:50 GMT\n\n";
#
$response = <S>;
($protocol, $status) = split(/ /, $response);
print "Response from HEAD request is: $response \n";
#
# check the Response. If it's OK, get the
modification time and
# compare that to the entry in our timestamp database. If
they
# match, set the return value to 1. Otherwise,
set the return value
# to 0 and use a GET to get the contents and write
to a file.
#
for ($i = 0 ; $i < 100; $i++) { # give the response
a chance to form
$response = <S>;
print "$response";
# display it on STDOUT
if ($response =~ /Last-Modified/i) { #
expect Last-Modified
($junk,
$time) = split (/: /,$response);
if
(!(($time_stamps{$path})) || ($time_stamps{$path} ne $time)) {
$time_stamps{$path}
= $time;
close
(S);
&write_file_to_disk; #
if file changed, save it to disk
return
0; # 0 means the file has been changed since
#
the last time we built a timestamp entry for it.
}
}
}
close(S); # close the Socket
return 1; # 1 means the file has not been changed.
}
#
# If the database timestamp does not match the actual file modification
# timestamp, write its contents to local disk using the c-program
# http_get. (see Listing 20.17 for the source of http_get).
#
sub write_file_to_disk {
print "Capturing File ... \n";
$contents = `/home/mginsbur/bin/http_get $server/$f`;
print "Captured: $server/$f successfully
... \n";
print "Appending $server/$f to file $outfile
... \n";
print OUTFILE "$contents";
print "$server/$f has been written to file
$outfile. \n\n";
}
This code is best illustrated with an example. Suppose that I
have a directory on a Web server corresponding to the URL http://edgar.stern.nyu.edu/testsock.
Here is a listing of the files in that directory:
-rw-r--r-- 1 mginsbur
staff 50284 Jun 27 17:21 analog.html
-rw-rw-r-- 1 mginsbur staff
1286 Jun 27 17:29 hydrant.html
Let's say that I run the program for the first time from the directory
~/test. Because it is the
first time, no timestamp database has been built yet and the files
are all new. Therefore, I capture both of them to a local file,
as shown in Listing 20.9.
Listing 20.9. get_head:
First program execution.
Processing file http://edgar.stern.nyu.edu/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:30:27 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Capturing File . . .
Captured: http://edgar.stern.nyu.edu/testsock/analog.html successfully
. . .
Appending http://edgar.stern.nyu.edu/testsock/analog.html to file
/home/mginsbur/Âfilecontents.txt . . .
http://edgar.stern.nyu.edu/testsock/analog.html has been written
to file /home/Âmginsbur/filecontents.txt.
Status: 0
Processing file http://edgar.stern.nyu.edu/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:30:28 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:29:51 GMT
Capturing File . . .
Captured: http://edgar.stern.nyu.edu/testsock/hydrant.html successfully
. . .
Appending http://edgar.stern.nyu.edu/testsock/hydrant.html to
file /home/mginsbur/Âfilecontents.txt . . .
http://edgar.stern.nyu.edu/testsock/hydrant.html has
been written to file /home/Âmginsbur/filecontents.txt.
Status: 0
Now, I run the program a second time without altering any of the
files on the Web server. Study Listing 20.9 and see whether you
can follow what action the program will take. Listing 20.10 shows
the output.
Listing 20.10. get_head:
Second program execution.
Processing file http://louvain.ny.jpmorgan.com/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:31:05 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Content-length: 50284
Status: 1
Processing file http://louvain.ny.jpmorgan.com/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:31:05 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:29:51 GMT
Content-length: 1286
Status: 1
Sure enough, because neither file was modified since the last
time I collected their timestamp information, the program takes
no action and returns a status code of 1
for each file.
It's time to complete the picture by changing one of the file's
timestamps. I can do this easily with the UNIX touch
command:
touch /usr/local/edgar/web/testsock/hydrant.html
Now hydrant.html has been
updated; analog.html's timestamp
still matches the original information collected in the program's
first run. The directory listing of /usr/local/edgar/web/testsock
has been correspondingly updated and two new files are present:
-rw-r--r-- 1 mginsbur
staff 50284 Jun 27 17:21 analog.html
-rw-rw-r-- 1 mginsbur staff
1286 Jun 28 16:31 hydrant.html
-rw-rw-r-- 1 mginsbur staff 0
Jun 28 16:30 timedb.dir
-rw-rw-r-- 1 mginsbur staff
1024 Jun 28 16:30 timedb.pag
The two timedb.* files compose
the timestamp database that the program get_head
creates the first time it is run and updates every subsequent
time it is run.
I run the program for a third time and the output in Listing 20.11
appears.
Listing 20.11. get_head:
Third program execution.
Processing file http://louvain.ny.jpmorgan.com/testsock/analog.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:32:12 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Thu, 27 Jun 1996 21:21:38 GMT
Content-length: 50284
Status: 1
Processing file http://louvain.ny.jpmorgan.com/testsock/hydrant.html
Response from HEAD request is: HTTP/1.0 200 Document follows
Date: Fri, 28 Jun 1996 20:32:12 GMT
Server: NCSA/1.5
Content-type: text/html
Last-modified: Fri, 28 Jun 1996 20:31:43 GMT
Capturing File . . .
Captured: http://louvain.ny.jpmorgan.com/testsock/hydrant.html
successfully . . .
Appending http://louvain.ny.jpmorgan.com/testsock/hydrant.html
to file /home/Âmginsbur/filecontents.txt . . .
http://louvain.ny.jpmorgan.com/testsock/hydrant.html has been
written to file
/Âhome/mginsbur/filecontents.txt.
Status: 0
As expected, the file analog.html
is checked against the timestamp database and, because it has
not been modified, no action is taken. The other file, hydrant.html,
was updated and new contents are fetched to the local file system.
One obvious use for the techniques presented in get_head.pl
is in the case of a Web-crawling search program; this process
traverses the Web looking for new content to index. If it can
effectively check the timestamp of files it encounters, it does
not need to download every single file it finds to index. It can
incrementally index and save a lot of network and CPU time.
Note
You might have noticed in the get_head.pl code listing a commented-out section near the HEAD method. This is a CONDITIONAL GET method, which is very similar logically. If a document does not meet the criteria specified in the
CONDITIONAL GET, a status code of 304 is returned, which means that the document was not modified in that timeframe. If it was modified in that timeframe, the contents immediately are fetched by the script. The HEAD method, by
contrast, generates a 200 (OK) message if all is well, and it's up to the script to do logical comparisons on the file metainformation received from the server.
The concepts presented in the Perl socket application are very
powerful and well worth study. As the Web grows, so does the noise-to-signal
ratio, and filtering mechanisms become essential. The idea of
selectively fetching only new documents is appealing to newsfeed
applications, text-index searching, and generalized agent technology.
A user can launch an application, for example, to fetch only new
documents from a favorite Web site. Such an agent quite easily
could automatically update the user's browser bookmarks file.
An
Integrated E-Mail Gateway Application
One of the advantages of Perl from the developer's perspective
is that a small building-block program easily can be customized
and integrated into a bigger application.
Consider the following real-life design problem stemming from
a telecommunications class final project at the NYU Stern School
of Business. A group of students wanted to write a set of Perl
CGI programs to provide on-line corporate recruiting, as shown
in these steps:(See note)
As a necessary preliminary step, the students create HTML
resumés and place them in a common directory.
The first CGI program is launched by the resumé system
administrator, resume_builder.pl,
automatically creating a table of contents linking to each resumé.
The program is smart enough to avoid creating links to files that
are not student resumés.
The output of resume_builder.pl
is resume_toc.html, which
provides the corporate recruiter with an Action button. If the
recruiter clicks this button, a picklist of all the resumés
appears (built at request time by resume_form.pl)
and the recruiter can click one or more names to receive a broadcast
e-mail message.
The third CGI program, resume_mail.pl,
is the e-mail gateway back end to resume_form.pl.
This program is the glue between the picklist and the actual UNIX
mail program.
The system is making the implicit assumption that between steps
2 and 3, the recruiter has scanned the resumés and located
the most promising ones.
I think it will be instructive to see the code that went into
resume_builder.pl, resume_form.pl,
and resume_mail.pl. Listing
20.12 provides this code.
Listing 20.12. resume_builder.pl.
#!/usr/local/bin/perl
#
# resume_builder.pl
#
# Resume Project
#
# this program will read in the directory and output
# HTML links to each valid resume (studentname.html is valid).
#
$site = "www.stern.nyu.edu ";
$basepath = "/usr/users/mark/book/src";
$output = "$basepath/resume.toc.html";
$link = "$basepath/index.html";
$hits = $misses = 0;
$prefix = "<dd><A HREF=\"http://www.stern.nyu.edu/~lma/project/";
$suffix = "\">";
open(OP, ">$output") || die
"cannot open the OUTPUT file";
@my_array = 'ls'; # set an array to the
unix output of 'ls'
&init; # write the header HTML lines
#
# Now loop through and pull out only the valid resumes
which are of the form
# (name).html
# Avoid this program's output (resume.toc.html), any
pictures (*.pic) files,
# and the special index.html file which is a symbolic
link to resume_toc.html
#
for ($i=0; $i<=$#my_array; $i++) {
($name,$ext) = split(/\./,$my_array[$i]); #
split xxxx.html on the period
&nbs
p; #
note the assumption that the file
&nbs
p; #
name has no extra
&nbs
p; #
embedded periods!
if (($name =~ /resume/) || ($name =~ /index/) || ($name
=~/pic/)) {
$misses++;
print "skipping $name.$ext \n";
} # command-line info msg
else{
$hits++;
$combo = $prefix.$my_array[$i];
print OP "$combo";
$real_suffix = $suffix.$name."</a>";
print "picking up $name resume \n"; #
command-line info msg
print OP "$real_suffix </dd><br>
\n";
}
}
print "\n $hits Hits and $misses Misses \n"; #
closing info msg
&trlr;
close(OP) || die "cannot close output";
#
# build a symbolic link to index.html * if one does
not yet exist *
#
if (-e $link) {
}
else{
'ln -s resume.toc.html index.html';
print "$link symbolic link built
\n";
}
exit 0;
#
# init - outputs the Title and header and introductory
msg
#
sub init{
print OP "<TITLE>WWW Resume Collection</TITLE><br>";
print OP "<H1>WWW Resume Collection</H1><br>";
print OP "Welcome to the NYU resume database. ";
print OP "It will match recruiters to qualified
candidates. ";
print OP "Recruiters can screen through our resume
database and contact ";
print OP "selected candidates via email by filling
out a form. <p>";
print OP "<HR><b> Click on a name to view a resume.
</b><br>";
print OP "<br>";
}
#
# trlr - outputs the trailing info and credits
#
sub trlr{
print OP "<br><br>If you wish to contact any
of the people in our
database, you have the option to send them an email message. To
do
so, click <A HREF =
\"http://$site/~lma/project/resume_form.pl\"><B>CONTACT
FORM</B></a><p>"; print OP "<Hr>Thank
you for using our database.<br>
We hope that you have found it useful.<p>";
print OP "<b>Project Team</b>";
print OP "<a href= \"http://$site/~pcheng\">Peter
Cheng";
print OP "<a href= \"http://$site/~pliu\">Peggy
Liu</a>";
print OP "<a href= \"http://$site/~lma\">Lisa
Ma</a>";
print OP "<a href= \"http://$site/~hshayovi\">Heshy
Shayovitz</a><p>";
print OP "<HR>";
Tip
The technique of defining an index.html symbolic link is very useful. If a user enters the resumé system and does not supply a file name, the server usually is configured to look for the file index.html (home.html is
another popular choice). Thus, in Listing 20.12, I check to see whether index.html exists. If it does not yet exist, I build the symbolic link to the output of the program. This step is necessary only once, of course; hence the existence check.
The next program, resume_form.pl,
builds the picklist of candidate resumés dynamically (see
Listing 20.13). Its structure is quite similar to resume_builder.pl.
Notice the high degree of modularity-the form is broken into rather
small subroutines. The dynamic build of the picklist is separated
into its own routine for easy readability and maintenance.
Listing 20.13. resume_form.pl.
#!/usr/local/bin/perl
#
# resume_form.pl
#
print "Content-type: text/html\n\n";
@my_array = 'ls'; # set an array to the
unix output of 'ls'
$site= "www.stern.nyu.edu ";
$prefix = "<A HREF=\"http://$site/~lma/project/";
$suffix = "\">";
&init;
&build_top_of_form;
&build_picklist;
&build_rest_of_form;
&trlr;
#
sub init{
print "<TITLE>WWW Resume Contact Form</TITLE><br>";
print "<H1>WWW Resume Contact Form</H1><br><HR>";
print "The following is a form which will allow you to send
messages ";
print "to the resumes of the candidates that you have just
viewed. ";
print "You have the option to send to multiple candidates
from the ";
print "picklist by holding down the CONTROL or SHIFT keys
and clicking on";
print "the desired names.<hr>";
}
#
# build_top_of_form - write common form header, up
to the point
# where the list of resumes must be generated.
#
sub build_top_of_form{
print "<FORM METHOD=\"POST\" ";
print "ACTION=\"http://$site/~lma/project/resume_mail.cgi\">";
print "<b> Contact Name: </b>";
print "<br>";
print "<INPUT NAME=\"cname\"><br>";
print "<b>Company: </b>";
print "<br>";
print "<INPUT NAME=\"Company\"><br>";
print "<b>Address: </b>";
print "<br>";
print "<INPUT NAME=\"Address\"><br>";
print "<b>Telephone #: </b>";
print "<br>";
print "<INPUT NAME=\"Tel\"><br>";
print "<b>Fax #: </b>";
print "<br>";
print "<INPUT NAME=\"Fax\"><br>";
print "<b>What is the subject of this message?</b>";
print "<br>";
print "<INPUT NAME=\"Subj\"><p>";
print "<b>Send to: </b><br>";
print "<SELECT NAME=\"resume\" size=7 MULTIPLE>";
}
#
# Note: the C for loop is quite unnecessary
in Perl. I could say
# for (@myarray) and accomplish the same thing.
#
sub build_picklist{
for ($i=0; $i<=$#my_array; $i++) {
($name,$ext) = split(/\./,$my_array[$i]); #
split xxxx.html on pd.
if (($name =~ /resume/) || ($name =~
/index/) || ($name =~ /pic/)) {
}
else{
print
"<OPTION>$name";
} # end the If statement
} # end the for loop
print "</SELECT><p>";
}
sub build_rest_of_form{
print "<b>Please type your message here: </b><br>";
print "<TEXTAREA NAME=\"message\" ROWS=10 COLS=50></TEXTAREA><p>";
print "<INPUT TYPE=\"submit\" VALUE=\"Send
Message\">";
print "<p>";
print "<INPUT TYPE=\"reset\" VALUE=\"Clear
Form\">";
print "</form><p>";
print "<hr>";
}
sub trlr{
print "<a href=\"http://www.stern.nyu.edu/~lma/project\">";
print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">";
print "Return to the Resume System</A>";
print "<HR>";
}
Two scripts down, one to go. I'll complete the trilogy with resume_mail.pl,
which is the program taking the output of resume_form.pl
(that is, the recruiter's name, company, telephone, fax, e-mail
message, and recipient(s) list) and piping it to the UNIX mail
program. Listing 20.14 contains the code.
Listing 20.14. resume_mail.pl.
#!/usr/local/bin/perl
#
# resume_mail.pl
#
#
$mailprog = '/usr/ucb/mail ';
$mailsuffix = '@stern.nyu.edu';
$comma = ',';
#
require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl'; #
modified cgi-lib.pl
# Print a title and initial heading and the Right Header.
&html_header("Mail Form"); # modified
because html_header takes an arg.
$i = 0;
# Get the input
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
# Split the name-value pairs
@pairs = split(/&/, $buffer);
#
# The next code is equivalent to using the &parse_request
subroutine
# that comes with the cgi-lib.pl Perl toolkit. The
goal is to get a
# series of name-value pairs from the form.
#
foreach $pair (@pairs)
{
($name, $value) = split(/=/, $pair);
# decode the values passed by the form
$value =~ tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",
hex($1))/eg;
# Stop people from using subshells to execute
commands
$value =~ s/~!/ ~!/g;
#
# build an array r_array composed
of all the names on the recipient list.
#
if ($name eq "resume")
{
$r_array[$i]
= $value;
$i++; }
$recip = "";
#
# Now build $recip - the valid string of recipients,
delimited by commas
# e.g. csmith@stern.nyu.edu,bjones@stern.nyu.edu,
# the minor problem: this technique ends with a faulty
final comma.
#
for (@r_array) {
#
$temp = $_.$mailsuffix.$comma;
$recip = $recip.$temp;
$temp = "";
}
substr($recip,-1,1) = ""; #
get rid of comma at end. Now $recip is fine.
$FORM{$name} = $value; # assoc. array
for rest of the form.
} # end for - each
# print "Final recipient List is $recip"; #
uncomment this for debugging.
# Now send mail to $recip which is one or more students.
#
# Include form info plus info at end about the user's machine
hostname and
# IP address.
#
open (MAIL, "|$mailprog -s \"$FORM{'Subj'}\" $recip
")
||
die "Can't open $mailprog!\n";
print MAIL "The contactname was $FORM{'cname'} from company
$FORM{'Company'}\n";
print MAIL "has sent you the following message regarding
your resume:\n\n";
print MAIL "------------------------------------------------------------\n";
print MAIL "$FORM{'message'}";
print MAIL "\n------------------------------------------------------------\n";
print MAIL "Their fax: $FORM{'Fax'}\n";
print MAIL "Their tel: $FORM{'Tel'}\n";
print MAIL "Their addr: $FORM{'Address'}\n";
print MAIL "Their co: $FORM{'Company'}\n";
print MAIL "\n----S E N D E R I N F O --------------------------------\n";
print MAIL "Recruiter at host: $ENV{'REMOTE_HOST'}\n";
print MAIL "Recruiter at IP address: $ENV{'REMOTE_ADDR'}\n";
close (MAIL);
&thanks;
exit 0;
#
# Acknowledge mail
#
sub thanks{
print "<H2><TITLE>Mail Sent!</TITLE></H2><P>";
print "<B>Your mail has been sent.</B><br>";
print "<B>Thank you for using our resume database!</B><br>";
print "<hr>";
print "<a href=\"http://www.stern.nyu.edu/~lma/project\">";
print "<img src=\"http://edgar.stern.nyu.edu/icons/back.gif\">";
print "Return to the Resume System</A>";
print "<HR>";
}
Discussion
of the Resumé Application
Starting from scratch, the entire application was built (by three
novice programmers and one supervisor) in three days. This is
a great advertisement for Perl and, more generally, the ease with
which on-line applications can be built using CGI scripting. The
system offers unlimited scope to grow (thousands of resumés
conceivably could be stored in the base directory) and an excellent
window by which corporate recruiters can interface with top students.
What's missing in the resumé-recruiter interface? Number
one on my wish list is database functionality to permit search
by keyword or other ad-hoc criteria-for example, "show
me all students with programming skills in C and C++" or
"show me all students who are graduating next term with foreign
language proficiency in French or Spanish." This falls within
the realm of database gateway programming and is discussed in
Chapter 21.
Figure 20.7 shows the output of the resume_builder.pl
program.
Figure 20.7 : The corporate recruiter travels to the URL.http://www.stern.nyu.edu-lma/project/ and sees a series of HTML links to student resumes, created by
resume_builder.pl.
Figure 20.8 shows the screen corporate recruiters see after they
submit the information in Figure 20.7. Now you have the opportunity
to send an e-mail message to one or more people in the picklist.
Figure 20.8 : The recruiter selects two lucky students to broadcast an overture to-who knows-perhaps a highpaying job ?
Extending
the Transaction: Serial Transmission of Many Data Files in One
Transaction
Often the developer is not content with sending one MIME header
and one body of data to the client. Suppose that I want to send
a series of images to the client in a logical loop. This is where
a Netscape MIME extension called x-mixed-replace
proves useful. X-mixed-replace
supports data transfer to the client in this general manner (shown
in Perl syntax):
A. print "Content-type:
multipart/x-mixed-replace; boundary=$sep\n";
B. print "\n--$sep\n";
B. print "Content-type: type/subtype\n"; #
fill in type/subtype
B. print "Content-length: $len\n\n";
B. print $buf; # where $buf is the
data; make sure to measure its length!
The first line, labeled A, always is required. A boundary delimiter
$sep must be defined, but
it doesn't matter which string is assigned to $sep.
Then, the programmer picks the appropriate MIME type and subtype
to display and repeats the lines in block B as often as required.
As the last line's comment indicates, it is important to measure
$buf exactly before sending
it to the client. The x denotes
that this is an experimental MIME type. Unfortunately, it is not
supported to mix different MIME types (for example, text and image
or image and video) in one x-mixed-replace
transmission. One Usenet reader complained that this was "pretty
x-mixed-up," especially because standard MIME (SMTP) messages
support multipart/mixed format. So for the time being at least,
you're confined to sending a single data format in this manner.
Also, until the scheme gains further acceptance, you need to require
a Netscape browser to handle the data stream.
The program presented in Listing 20.15, nph-image,
shows the x-mixed-replace
notion in action. This script displays a series of photographs
serially, in a logical loop, to the client. It is an NPH-script
that does not use operating system buffering.
Listing 20.15. nph-image.
#!/usr/local/bin/perl
#
# nph-image: a no-parse-header script to
display a series of jpeg photos
#
serially; it is referenced within an IMG SRC tag in x.html.
#
Its output is understood by Netscape clients.
##############################################################################
require '/usr/local/etc/httpd/cgi-bin/cgi-lib.pl';
$photo_dir = "/is-too/tisakowi/web/isweb/testsite/photos/*.jpg";
$type = "image/jpeg"; # or image/gif depending
on the application
@photos = `ls $photo_dir`; # assemble the photo array
$SIG{"ALRM"} = "exit"; # in case
user hits STOP during the transmission
alarm 10*60; #
timing delay
#
# set the delay between pictures from the Query String,
otherwise
# set it to 1 second.
#
if (defined($ENV{'QUERY_STRING'})) {
$delay = $ENV{'QUERY_STRING'}; } #
try to get delay from Query String
else {
$delay = 1; }
$sep = "=-+=-+=-+=MULTI__PART__SEPARATOR-+=-+=-+="; #
this is arbitrary
$| = 1; # unbuffered i/o is important in nph-scripts
print "Content-type: multipart/x-mixed-replace; boundary=$sep\n";
# req'd.
$first = 1;
do {
foreach $f (@photos) {
if (!$first) {
$first = 0; }
else {
sleep($delay) }
&output($f); }
}
while (1);
print "\n--$sep--\n"; # this will never occur
# unless infinite loop broken.
#
# subroutine output: print out *exactly*
the buffer needed
# for each picture. Measure the picture's
length (normally
# with a stat function, except with jpegs needed
to do it
# a clumsier way).
#
sub output {
local($file) = @_;
local($len);
print "\n--$sep\n";
open(FILE, $file) || die "Error finding file $file";
print "Content-type: image/jpeg\n";
#$len = (stat($file))[7]; # does not seem to work on
jpegs
$line = `ls -al $file`;
@stuff = split(/\s+/,$line);
$name = $stuff[2]; # primitive nametag
$len = $stuff[3]; # perl - got the length
print "Content-length: $len\n\n";
read(FILE, $buf, $len);
close(FILE) || die "cannot close file $file";
print $buf;
for ($a=0; $a<20; ++$a) {print "\n";}
}
Code Discussion: nph-image
This program sets up an infinite loop to show all the *.jpg
photographs in a given directory, using the general x-mixed-replace
scheme illustrated earlier.
The Perl statement
$|=1;
unbuffers the I/O-the default operating system buffering is not
used. This is done to avoid images building up in a buffer and
then being released all at once, confusingly, to the client desktop.
Another interesting feature is the use of the statement
$SIG{"ALRM"} = "exit";
This traps the signal sent by the user clicking the Stop button
in the Netscape browser. Without this trap, it might be very difficult
to stop the constant stream of rotating images, and the user might
even have to take the drastic step of killing the browser. Hopefully,
with this trap, the Stop button will halt the script in a reasonable
amount of time.
The other requirement that is important to note is the fact that
I must measure each image, in bytes, before writing it to the
client's stdout. Otherwise, images can overlay sections of the
preceding ones incompletely and haphazardly. As you see in Listing
20.15, the line
#$len = (stat($file))[7]; # does not
seem to work on jpegs
is commented out. This is the simplest way to get a length, which
I use on *.gif images, but
the function did not work for me on *.jpgs.
I had to use a workaround as shown in the code. At any rate, after
the length is known, exactly that amount is read into an input
buffer and then is written out. The result is a smooth series
of images (thanks to the unbuffered I/O and the care taken to
measure image lengths). After the program executes, it pushes
image data at the client indefinitely, and this is a significant
network load. A more sensible approach is to end the image rotation
after a certain time interval or maximum number of images.
Figure 20.9 shows the URL http://edgar.stern.nyu.edu/mgtest/x.html
shortly after it is loaded into the Netscape client.
Figure 20.9 : An infinite series of repeating images is presented, with each image replacing the preceding one.
To complete the discussion of this animation application, Listing
20.16 shows the first few lines of the file x.html;
note how the nph-image is
embedded inconspicuously in the HTML img
src tag near the top.
Listing 20.16. x.html.
<HTML><BODY bgcolor="#000070"
text="#30ebe0" link="#d0d000" vlink="#ffffff">
<center>
<img height=125 width=125 src="http://edgar.stern.nyu.edu/mgbin/nph-image?1">
</center>
<TITLE>The Department of Information Systems Homepage</TITLE>
<H2>
The Department of Information Systems</H2>
(etc.)
Code Debugging
Debugging is a normal part of the developer's life. The first
line of defense is syntax checking. For example, in Perl, I can
type
perl -c <progname>
to check the Perl code for syntactic correctness. If the Perl
interpreter likes the code, but the http server doesn't, there
is more work to be done. Fortunately, the CGI environment is flexible
enough to give the developer several options for discovering the
source of code problems.
When a CGI program crashes, the uninformative 500
Server Error message is displayed on the client screen.
If the developer has access to the server's error log, that might
provide a clue. A common error is not printing a proper header.
A script without a blank line after the Content-type
statements follows:
Print "Content-type: text/html\n":
Print "<TITLE>A Bad Script</TITLE>\n";
This causes the following to show up in an NSCA server's error_log
file:
[Tue May 16 20:19:04 1995] httpd: malformed
header from script
When this shows up by itself, check the headers.
If command-line syntax checking has not been done and the script
has a syntax error, usually these errors will show up in the error_log:
syntax error in file /web/httpd/cgi-bin/bourne/break_something.pl
at line 8, next 2 tokens "priint "GET form:""
Execution of /web/httpd/cgi-bin/bourne/break_something.pl
aborted due to compilation errors.
[Tue May 16 20:29:39 1995] httpd: malformed header from
script
In this case, the print statement has a typo, which was duly reported
in the error_log.
The server error logs might not provide enough information, though,
or the developer might not have direct access to the logs.
In that case, my first suggestion is to test the gateway program
on the host machine command line. Runtime data, such as values
for environment variables, or stdin, will have to be provided.
Supplying runtime data was described in the section "How
the Server Passes the Data to the Gateway Program."
If the script runs without errors on the command line, but the
output is still not what is expected, the problem might lie in
how the gateway program is looking at the incoming data sent by
the server or how the gateway program is outputting data. Developers
might find it useful to generate their own log files-that is,
to insert code into the gateway program to write input and output
to temporary files. A basic technique in Perl is to create a dump
file with code such as this:
open(DUMP, ">>my_debug_file.tmp")
|| die "cannot open dump file";
Then, you can write any variables that need to be examined:
read(stdin, $input_line, $ENV{CONTENT_LENGTH});
print(DUMP "$input_line\n");
This is a very useful method of debugging; the full range of stdin,
command-line arguments, and environment variables can be examined.
In addition, a separate file for each transaction can be created
by including the process ID in the file name. In Perl, this is
$$:
open(DUMP, ">>my_debug_file.$$.tmp")
|| die "cannot open dump file";
This code creates a new file for each run of the script.
Fetching the Contents of a URL: The http_get.c
program.
Listing 20.17 shows the http_get.c
code that I used in the Perl socket example shown in Figure 20.8.
You'll see this program again, in Chapter 22's
discussion of text search tools.
Listing 20.17. The http_get.c
code.
/* http_get - fetch the contents of an
http URL
**
** Originally based on a simple version by Al Globus <globus@nas.nasa.gov>.
** Debugged and prettified by Jef Poskanzer <jef@acme.com>.
*/
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
static char* argv0;
/* Gets the data at a URL and returns it.
** Caller is responsible for calling 'free' on returned *data.
** Returns -1 if something is wrong.
*/
long
getURLbyParts( void** data, char* machine, int port, char* file
)
{
struct hostent *he;
struct servent *se;
struct protoent *pe;
struct sockaddr_in sin;
int sock;
int bytes;
char buf[10000];
char* results;
size_t size, maxsize;
char getstring[2000];
he = gethostbyname( machine );
if ( he == (struct hostent*) 0 )
{
(void) fprintf( stderr, "%s: unknown
host\n", argv0 );
return -1;
}
se = getservbyname( "telnet",
"tcp" );
if ( se == (struct servent*) 0 )
{
(void) fprintf( stderr, "%s: unknown
server\n", argv0 );
return -1;
}
pe = getprotobyname( se->s_proto );
if ( pe == (struct protoent*) 0 )
{
(void) fprintf( stderr, "%s: unknown
protocol\n", argv0 );
return -1;
}
bzero( (caddr_t) &sin, sizeof(sin)
);
sin.sin_family = he->h_addrtype;
sock = socket( he->h_addrtype, SOCK_STREAM,
pe->p_proto );
if ( sock < 0 )
{
perror( "socket" );
return -1;
}
if ( bind( sock, (struct sockaddr*) &sin,
sizeof(sin) ) < 0 )
{
perror( "bind" );
return -1;
}
bcopy( he->h_addr, &sin.sin_addr,
he->h_length );
sin.sin_port = htons( port );
if ( connect( sock, (struct sockaddr*)
&sin, sizeof(sin) ) < 0 )
{
perror( "connect" );
return -1;
}
/* Send GET message to http. */
sprintf( getstring, "GET %s\n",
file );
if ( write( sock, getstring, strlen( getstring
) ) != strlen( getstring ) )
{
perror( "write(GET)" );
return -1;
}
/* Get data. */
size = 0;
maxsize = 10000;
results = (char*) malloc( maxsize );
if ( results == (char*) 0 )
{
(void) fprintf(
stderr, "%s:
failed mallocing %d bytes", argv0, maxsize );
return -1;
}
for (;;)
{
bytes = read( sock, &results[size],
maxsize - size );
if ( bytes < 0 )
{
perror( "read"
);
return -1;
}
if ( bytes == 0 )
break;
size
+= bytes;
if ( size >= maxsize )
{
maxsize *= 2;
results = (char*) realloc( (void*) results,
maxsize );
if ( results == (char*) 0 )
{
(void) fprintf(
stderr, "%s:
failed reallocing %d bytes", argv0, maxsize );
return -1;
}
}
}
*data = (void*) results;
return size;
}
/* Get the data at a URL and return it.
** Called is responsible for calling 'free' on returned *data.
** url must be of the form http://machine-name[:port]/file-name
** Returns -1 if something is wrong.
*/
long
getURL( void** data, char* url )
{
char* s;
long size;
char machine[2000];
int machineLen;
int port;
char* file = 0;
char* http = "http://";
int httpLen = strlen( http );
if ( url == (char*) 0 )
{
(void) fprintf( stderr, "%s: null
URL\n", argv0 );
return -1;
}
if ( strncmp( http, url, httpLen ) )
{
(void) fprintf( stderr, "%s: non-HTTP
URL\n", argv0 );
return -1;
}
/* Get the machine name. */
for ( s = url + httpLen; *s != '\0' &&
*s != ':' && *s != '/'; ++s )
;
machineLen = s - url;
machineLen -= httpLen;
strncpy( machine, url + httpLen, machineLen
);
machine[machineLen] = '\0';
/* Get port number. */
if ( *s == ':' )
{
port = atoi( ++s );
while ( *s != '\0' && *s != '/'
)
++s;
}
else
port = 80;
/* Get the file name. */
if ( *s == '\0' )
file = "/";
else
file = s;
size = getURLbyParts( data, machine, port,
file );
return size;
}
void
main( int argc, char** argv )
{
void* data;
long size;
argv0 = argv[0];
if ( argc != 2 )
{
(void) fprintf( stderr, "usage: %s
URL\n", argv0 );
exit( 1 );
}
size = getURL( &data, argv[1] );
if ( size < 0 )
exit( 1 );
write( 1, data, size );
exit( 0 );
}
Gateway
Programming Fundamentals Check
The developer should understand the importance
of MIME headers, how to implement them in Perl and the Bourne
Shell, and how to use standard Perl toolkits to ensure proper
MIME headers.
The developer should be able to quickly
prototype code that uses standard input (forms, with METHOD=POST)
or environmental variables such as PATH_INFO
and QUERY_STRING.
Debugging skills are essential. The developer
should be able to match the most commonly encountered errors with
likely causes and then take appropriate action.
Techniques such as using the Location
header for redirecting another URL to the client and server-side
includes should be standard tools in the developer's arsenal.
The developer always should apply good
programming practice to a Web project-providing easy-to-read and
well-documented code, using subroutines to avoid redundancy (modularity),
and most important, not reinventing the wheel! Surf the Net and
scan the Usenet newsgroups to see how other sites have solved
similar problems.
Footnotes
The NCSA httpd distribution includes
the handy cgi-handlers.pl
set of useful subroutines, available via anonymous FTP at ftp://ftp.ncsa.uiuc.edu/Web/httpd/Unix/ncsa_httpd/cgi/cgi_handlers.pl.Z.
There is a similar package from Steve Brenner called cgi-lib.pl,
and it is retrievable from http://www.bio.cam.ac.uk/web/cgi-lib.pl.txt.
On-line documentation describing
environment variables is at http://hoohoo.ncsa.uiuc.edu/cgi/env.html.
On-line documentation describing
server-side include techniques and available variables is located
at http://www.webtools.org/counter/ssi/step-by-step.html
and, more specific to the NCSA httpd server, http://hoohoo.ncsa.uiuc.edu/docs/tutorials/includes.html.
You can find the Un-CGI
package at http://www.hyperion.com/~koreth/uncgi.html.
The Perl newsgroups-for example,
comp.lang.perl.announce
and comp.lang.perl.misc-have
frequent guest appearances from author Larry Wall.
Lynx is available from http://www.cc.ukans.edu/
and offers a browser which, if the client can live without graphics,
is a quick and handy way to browse the Web.
Thanks to Aleksey Shaposhnikov
for his programming labor on the Perl sockets application.
Lisa Ma, Peter Cheng, Peggy Liu,
and Heshy Shayovitz worked with me to create the resumé
application in the Spring 1995 Telecommunications class, Stern
School of Business, Information Systems Department, New York University.
Instructor: Professor Ajit Kambil.
Wyszukiwarka
Podobne podstrony:
ch20ch20ch20 (8)ch20ch20ch20 (2)ch20 (3)ch20 (17)ch20 (16)CH20ch20ch20Ch20 pg645 654ch20ch20ch20więcej podobnych podstron