For those of you with HTML experience, please bear with me. I may actually provide some information you don't already have. I will undoubtedly have some opinions and you might disagree with them.
The first step in authoring an HTML document is to acquire an information base. Unlike the other things we have covered in this class, there are no real man pages for HTML. However there are lots of on-line resources. NCSA is the National Center for Supercomputing Applications at the University of Illinois, Urbana-Champaign. Their Beginners Guide to HTML is a very nice starting point. I also use a site that is more like man pages called HTML Elements . These two get me through most days. Perhaps the best reference though is the organization that sets the HTML standards at www.w3.org. I use the HTML 4.0 specification when I need to find out esoterica. Be warned. Just because it is the current specification, does not mean that it is supported by any browsers.
This is another thing that must be understood for good construction of HTML documents. Not all browsers have implemented all of any particular version of HTML. And the specifications allow things like
The PRE element tells visual user agents that the enclosed text is "preformatted". When handling preformatted text, visual user agents:
- May leave white space intact.
- May render text with a fixed-pitch font.
- May disable automatic word wrap.
- Must not disable bidirectional processing.
You can see that there is a lot of leeway in how the agents (browsers) can handle things and still be HTML 4.0 compliant. Also you will encounter documents that don't react "properly" when you access them. This is because someone used browser specific tags and never bothered to check to see what the document looks like with some other browser. This is especially noticeable with documents created with automatic authoring tools. They tend to use Explorer or Netscape standards and completely ignore how other browsers might view the document. Unfortunately, although it is trying to be discouraged, this use of browser specific tags is not prohibited when being "compliant".
In the normal course of things, tags are paired. That means that there is a tag which begins some type of format/display and another tag specifying the end of the format/display. Usually the ending tag is the same as the beginning tag except it has a / (slash) in front of the text. For example to display a section of text in bold font use <b> and to stop the bold use </b>. The text enhancements like bold , italics , and teletype require an ending tag. Many other tags can be used without ending tags. They explicitly end when some other tag is encountered. An example is <p>, paragraph. Paragraph forces a line break after the tag to separate paragraphs of text. Any particular paragraph ends when a new one begins so the ending tag, although not incorrect, is not necessary.
I think this is a good place to discuss how text is normally displayed. The browser places one blank space between words regardless of how many whitespace characters separate words. For instance these words are each on a separate line and there are two blank lines here. This makes it easy to type and not worry about such things as how many words on a line. It makes it a little more difficult to come up with a specific text format.
We have already discussed the paragraph , <p>, tag. These fancy larger sized text segments that are used to mark sections are begun with <hn> where n is number between 1 and 5. The idea is that each number indicates a sub-section heading, 1 is the "largest" and 5 is the "smallest". These do require an ending tag.
There are several ways to format text. The easiest is preformatted, <pre>, which means keep the text between the beginning and ending tags exactly the way it was typed in. This means that if the text is long or the browser window is resized the preformatted text will not wrap. If you are just looking for indentation the best way is with lists. There is an ordered list, <ol> and an unordered list, <ul>. Ordered lists are numbered and if nested the numbers restart from the beginning. Netscape has an extension that would allow you to change the type of numbering to correspond to an outline style
Including images such as gif and jpeg is possible using <img source=> and placing the name name of the source file after the source=.
Finally special characters. Special characters are specified using &name/value;. Please include the semi-colon. It is part of the specification and many browsers require it to display the characters correctly. Examples of characters that require this format are
Last comment. When editing HTML documents, make sure that you save the document after each edit and reload the document to see the changes. One of the problems is that browsers like Netscape cache the pages that they have read currently and only update the cache intermittently. So you need to find out how to have your browser force the cache to be updated. You can change the preferences about cache and usually force the cache to be updated for a particular page. With Netscape if you press the shift key while selecting Reload the cache update is forced.
CGI stands for Common Gateway Interface. This is supposedly a 'standard' interface that lets Webservers execute other programs. And this execution is 'transparent' to the client. By this I mean that the person who is using a browser like Netscape and accesses a site has no direct knowledge of what the server there is doing and needs to know only that if he clicks on what looks like a link, something will happen. The client needs to know none of the gory details.
There is a penalty for this 'transparency'. That is security. Just as our www-home directories, and all the files in them, need to be world readable and executable, the same applies to the CGI files (commonly called scripts). This allows the potential evil hacker to cause problems using even the most simple scripts if you don't take precautions. Assume the worst and always err on the side of safety. For the beginning, read the following page on CGI security For the moment, make sure that your cgi-bin directory, in fact all your www-home directories, are NOT writable by anyone but you.
1. Read the user's form input. 2. Do what you want with the data. 3. Write the HTML response to STDOUT.
Enough of the paranoia. We need to move on to how this thing works. First, our server like many others expects to find the CGI scripts in a directory named cgi-bin. On our system this is a subdirectory of the user's www-home directory and needs to have world readable and executable permission (0755). Some systems have only one cgi-bin directory for everyone and some use some other common directory. So to start you will need to have a www-home directory in your home directory and a cgi-bin directory in that and both need to be world readable and executable(0755).
Now what? Well before you write your first script there is some background information you will need. First an overview. When the server receives a request, it gets some data and executes a script to which it passes that data. The script reads in the data, does something, sends some data back to the server and exits. The server takes this data from the script and performs some appropriate action. Simplistic view but good enough. For the first part, when the request is sent through the CGI, the server puts data into predefined holding areas and formats the data. The holding areas are in the form of environment variables set for the script.There are a large number of these environment variables set for your script. They fall into three groups, server-specific, client-specific and request-specific. Server-specific variables are about the server itself; SERVER_NAME, SERVER_PORT, SERVER_PROTOCOL, SERVER_SOFTWARE and GATEWAY_INTERFACE. Client-specific variables are all prefixed with HTTP_ and include ACCEPT (type of response scheme), REFERER (URL of the document that gave the link to the current document) and USER_AGENT (client software). There are more but not really interesting of often used/supplied. Of more interest are the request-specific variables. The 3 most important are REQUEST_METHOD, CONTENT_LENGTH and QUERY_STRING. The combination of these tells how the the request was sent, how much information is available and can provide the information. There are also more of these variables, we will take a quick look at how many more later.
REQUEST_METHOD : either POST or GET, any other string should be treated as invalid CONTENT_LENGTH : Number of bytes passed via STDIN from a POST method. QUERY_STRING : Data passed as part of the URL, anything after ? in the URL.
We have seen that information is stored in these environment variables that are made available to the script when it executes. But not all the information is always in the variables. Sometimes we need to read from STDIN. This is as obvious as it sounds. When the method is POST, data is piped to STDIN where it can be read in the normal fashions. CAUTION!! Some people who are naturally malicious try things like writing 4k to a 4 character field. Now the server doesn't know anything about how much data your program wants and it is up to you to make sure that you only read exactly how many bytes you need. So after you have gathered the input from whatever source, processed in any manner you like you should be ready to provide some output. The easiest form of this is writing back to STDOUT, which the server handles, interprets and returns to the client.
Content-type: text/html
plus another blank line, to STDOUT. After that, write your HTML response page to
STDOUT, and it will be sent to the user when your script is done.
That's all there is to it. Yes, you're generating HTML code on the fly.
It's not hard; it's actually pretty straightforward. HTML was designed to be
simple enough to generate this way.
There is a huge amount of other information that is available and in time you may even need it but for now lets try some simple things. Look at the link hw.pl. The first line specified to the client what it is reading. This is the MIME (Multi-part Internet Mail Extension) format specifier. In this case it should interpret everything as an html document. The blank line between the header and the body keeps things from coming out garbage. The rest is just a simple html document.
Well that was simple enough but you did not see any of the fancy stuff about the variables. Now look at varlist.pl. So this just does some of the things we have already talked about in the Perl lecture. Particularly see that the %ENV variable is just a large associative array. I will leave you to figure out the exact details.
There are a couple more things you need to know about CGI before you get started. The first is the format of the data that is passed via QUERY_STRING. it consists of
variable1=data1&variable2=data2&.... where the variables are simple names and the data is strings in the following format str1+str2+str3+.. the strings have NO spaces. All spaces must be represented by the + sign.
When the user submits the form, your script receives the form data as a set of name-value pairs. The names are what you defined in the INPUT tags (or SELECT or TEXTAREA tags), and the values are whatever the user typed in or selected. This may seem odd but it works. Additional you need to know that the scripts are normally executed as the user nobody. So there is no shell .cshrc or .profile, no $HOME, nothing but what is set by the server. This set of name-value pairs is given to you as one long string, which you need to parse. It's not very complicated, and there are plenty of existing routines to do it for you. Here's one in Perl
The standard ASCII characters whose ACS ii values are above 127 or below 33 must be specially represented. That is by using the percent sign and following it with the two-digit hexadecimal value of the character. Of course this means that %, &, + must all be escaped because they represent specific things other than their character. I know you don't know the hex value of characters but if you man ascii you can find these values.So now to show you how to put this together. I would like to say to you that I have some original thought on the subject but others have done much of the work already I won't reinvent the wheel. Please refer to the tutorial document. This was prepared by Luojian Chen, a research assistant and TA for Dr. Berry. If you have any questions, send me an email and we will go over it in class next week.