Composing Good HTML [HTML 2.0 Checked!] Note: This document is available as both a single document (suitable for printing) and a multi-part document (more appropriate to hypertext). There is also a postscript version available via FTP, at jupiter.willamette.edu, as /outgoing/jtilton/strict-html.ps. These multiple views are automatically generated with a Perl script called "multiview". ---------------------------------------------------------------------------- What's New with this document New (June 2, 1995): The current edition of this document is available online at http://www.cs.cmu.edu/~tilt/cgh/. This is a new address, reflecting my current status as a graduate student in Computer Science at Carnegie Mellon University. This new URL will be stable until the year 2000, I expect. The old URL will remain available at least until December 1995, but is not guaranteed to be current. New (April 26, 1995): This document will appear in substantially revised form this fall as a chapter in a new book from Addison-Wesley, Web Weaving (ISBN 0-201-48959-7), by Tilton, Steadman, and Jones; the book will address issues of creating and maintaining Webs as infostructures that are relevant, usable, and maintainable. Watch this space for upcoming details. ---------------------------------------------------------------------------- Introduction As the Web continues to explode in its own inimitable fashion, it is becoming more and more important to write HTML that conforms to certain guidelines. Specifically, with the current diversity of clients for the Web (and we can only expect to see more!), it's become important to write HTML that will look good on any client, and not just on the specific client which the author may have access to. To that end, there are a few solutions. One approach is this one -- documents which point out common errors one might make in the composition of HTML. The other approach is software based -- a "lint"-like program for catching semantic errors in HTML, and perhaps even correcting them (for this, you should examine either HalSoft's HTML Validation Service or WebLint, two services which I have failed to list here for far too long). Several astute observers have noted that "Composing Good HTML" contains some HTML errors -- although there's a reason for that. The thing to bear in mind is that, if you follow these guidelines, your document may not look as best as it possibly can on a particular browser. However, it also will not look ugly on any browser, which is the risk you take by disregarding these recommendations and tweaking your HTML for, say, Mosaic. Unfortunately, Mosaic may render things differently from Lynx which may render things differently from TkWWW, etc, etc, etc. These guidelines, in essence, should ensure the best fit across the space of all possible browsers, if you get my drift. This document does not purport to be a style guide, or a beginner's manual to HTML. Fine documents already exists for these purposes. (Note: This document is fairly stable, but still open to amendment. Please feel free to comment on that which is missing, wrong, right, or silly. Especially, please point out anywhere that I don't follow my own guidelines -- I'll slink back and fix it, I promise! Thanks to everyone who's already done so!) ---------------------------------------------------------------------------- Contents of this Document * What's New with this document * Introduction * Contents of This Document (Douglas R. Hofstadter, Please...) * Good Practices o Signing Documents, and Time-Stamps o (anything else?) * Common Errors o Paragraph Break Errors o Character and Entity Reference Errors o URL Errors + Directory Reference Errors + Not Using Fully Qualified Domain Names + Improper Use of Relative URLs o Missing Quotes in Start Tags o Missed End Tags * Things to Avoid o Mixing HEAD and BODY Elements o Using White Space Around Element Tags o Heading Usage o Meaningless Link Text o Physical vs. Logical Character Emphasis * Deprecated and Obsolete Elements * For More Information * Acknowledgments Good Practices Things contained in this section are good practices for the generation of any HTML document. Specifically, this would include anything which should routinely be done in the creation of documents for the benefit of both reader and author. Signing Documents, and Time-Stamps It is a good idea to sign and date all documents served on the Web, so that people viewing the documents can form some impression of the authority of the document (i.e. how recent it is, and how reliable the information provider is). For example, this document has been signed. Also, when dating a document, try to avoid ambiguous formats. For example, both the month/day/year and day/month/year format are used on the web -- so is "4/2/94" April 2 or February 4? A solution to this is to use the name of the month (or an abbreviation). Finally, the best way to sign a document is to include a LINK element of type "made" in your HEAD element. For example:
element is that it signals an end-of-paragraph, rather than a paragraph break. According to the specification, "
is used between two pieces of text which otherwise would be flowed together". In most cases this is not important -- functionally, the
serves as an end-of-paragraph marker. However, in certain contexts, use of
should be avoided, such as directly before or after any other element which already implies a paragraph break. To wit, the
element should not be placed either before or after the headings, HR (can I get a ruling on this? people don't handle HR consistently... X Mosaic has no white space before or after, and Lynx appears to put white space after), ADDRESS, BLOCKQUOTE, or PRE. It should also not be placed immediately before or after a list element of any stripe. That is, a
should not be used to mark the end-of-text for
in order to fix white space problems, please think twice and avoid it if you can. Also, when using the glossary list (DL), please try to avoid using multiple DD's (definitions of terms) in order to provide multiple entries for a term (DT). Instead, use a
marker between paragraphs in a definition. The use of a DD (definition) without a matching DT (term) is illegal, although a DT without a DD can be used without dire consequences. All clear now? Character and Entity Reference Errors Simply put, a character reference and an entity reference are ways to represent information that might otherwise be interpreted as a markup tag. For instance, in order to represent
in this text, I had to use <P> in my raw HTML. There are currently five entities for this purpose in HTML, as well as several entities which allow encoding of the ISO Latin-1 Character Set. The most common error in the use of references is to leave off the trailing semicolon. Also, no additional spaces are needed before or after the entity/character reference. URL Errors Another misunderstood aspect of HTML is in the composition of URLs. Directory Reference Errors One grey area involves references to directories. It is possible to request an index of a directory from an HTTP server. The typical response from the server is to either return a pregenerated index document (which is often the document "index.html" in the referenced directory), or to construct an HTML document on the fly which contains a listing of all files in the directory. However, when making such a directory reference, it is important to make sure to have a trailing slash on the URL. That is, if you were to request the index of the directory which this document resides in, you would want to refer to it as http://www.cs.cmu.edu/~tilt/, not as http://www.cs.cmu.edu/~tilt. Some servers are able to catch these errors, and provide redirection to the proper URL, but it's best to get the URL right in the first place -- notably because not all browsers support transparent redirection. Not Using Fully Qualified Domain Names Problems can arise when the hostnames in URLs aren't fully qualified In local networks, you can usually refer to your own machines simply by their names -- for instance, here at Willamette we refer to our local WWW server as "www". However, the server's FQDN (fully qualified domain name) is "www.cs.cmu.edu". The FQDN provides enough information that any host, anywhere on the Internet, can find this particular machine. (It's like trying to find all the Vermeers in New York :). What happens is that an HTML author might construct a link that looks like this: Metanoia -- A Change In Spirit which produces a link to Metanoia -- A Change In Spirit that will only work for people in the local network that that machine is on. A correct link would look like this, instead: Metanoia which would allow all of you who are interested in Metanoia to actually follow the link. This leads almost directly into: Improper Use of Relative URLs Finally, a brief section on relative URLs. It is possible to construct a "relative" URL, which gives you the following advantages: * It's shorter. * It makes a collection of documents which are linked together more portable (easier to move from directory to directory, or server to server). However, relative URLs can also break things. A relative URL is a URL which doesn't contain all the necessary parts of a "full" URL (scheme, host, path information). There's a large number of things which might fit this description! The browser will try to assume the parts that have been "left out" by using the information from the URL of the document which contains the link. However, not all browsers will make these assumptions in the same way. Here's a short list of what's "safe" and "unsafe" (based on experience, and not on a specification anywhere -- unfortunately). Safe: Same directory relative URLs A reference to a document in the same logical directory (such as Good Practices) is safe. This kind of reference, roughly speaking, contains no "/"'s. Safe: Same server relative URLs A reference to a document in the same server (such as Eric's Hyplan) is also safe. This kind of reference, roughly speaking, will begin with a "/". (It will also be semi-absolute, in that it starts at the top of that server's directory structure...) Unclear: Most other kinds of relative URLs References such as can be dangerous -- sometimes browsers will interpret that as meaning "go up one directory level, find the directory '~tilt', and then find 'euphonium.html' in it." And sometimes they won't. Currently, I don't understand this problem well enough to speak about it. I will try and get a canonical answer when next I have the energy to update this document. Unsafe: "file://localhost/..." It's also possible to have a reference to "file://localhost/some/file/pathname". What this does is references the file described on the local host of whoever is browsing the document. Which is why a reference to will display the message of the day on your machine, not the message of the day on my machine. Unless you know what you are doing, these references will really mess up your documents. (This sub-section isn't written very well, I fear. If anyone has any better copy, I'll gladly put it here instead. -et/April 7, 1994) Missing Quotes in Start Tags One common error that I used to make all the time (I use Marc Andreesen's html-mode.el for Emacs these days -- I had to learn Emacs, but now it's so much easier to write HTML!) was to leave off a quote in my start tags. For example, this reference to the euphonium, king of instruments should look like: but I would often use CZeCh THIZ 0uT would be rendered as CZeCh THIZ 0uT . On some browsers, there may be white space around the anchor, which adds unwanted unsightliness to the rendering, and may lessen the impact of the document. (This comment really applies to white space immediately following start tags, and immediately preceding end tags). Heading Usage The HTML specification points out that a heading should not be more then one level below the heading which preceded it. That is,