Re: IETF HTML V. 2.0 Question

Murray Altheim (murray@spyglass.com)
Wed, 12 Jun 1996 14:24:17 -0500


Message-Id: <v02110114ade49b11eb44@[140.186.34.50]>
Date: Wed, 12 Jun 1996 14:24:17 -0500
To: "Solko, Dave (SOLKODE)" <SOLKODE@exchange.uc.edu>
From: murray@spyglass.com (Murray Altheim)
Subject: RE: IETF HTML V. 2.0 Question
Cc: www-html@w3.org

Dave Solko <SOLKODE@exchange.uc.edu> writes:
>>From:  murray@spyglass.com
>>Sent:  Thursday, June 6, 1996 9:39PM
>>
>>Figure 3 on page 360 of Goldfarb's "The SGML Handbook" describes the
>>SGML markup characters. Those that are processed in a document instance
>>need to be 'escaped'. In HTML these include STAGO ('<'), TAGC ('>'),
>>LIT ('"') and ERO ('&').
>
>So, in order for my HTML to be proper SGML, I have to use the escape
>characters for all quotes, ampersands and angle-brackets? I have only
>found the need to escape from the angle-brackets (and actually, only the
>less than sign) in order for my HTML to be displayed correctly and
>validate.

Dave,

You are correct. You only have to be sure your content doesn't
"accidentally" create HTML markup: it's a matter of context. In proper
context, *none* need to be replaced in a system using an accurate SGML
parsing engine (that can always figure context), but on the web, where both
browsers and documents are frequently broken, it's simply safer. I stated
"need to" because on the Web, browser parsing engines may or will make
mistakes, and some mistakes can be costly. And from looking at the
percentage of invalid documents on the Web, it's better to err on the safe
side.

I've seen whole sections of text disappear from home pages of prominent
companies because the pages were never validated, and often simply
replacing those four characters with their entity equivalents would have
displayed the problem, even if the document authors didn't validate their
document.

I advocate (particularly for machine translators) simply replacing all the
"sensitive" characters with their entity equivalents, with some reasoning:

    "&"
         not followed by a space accidentally becomes an entity
    "<"
         followed by a character becomes a start tag. In some browsers,
         this is simply parsed as the beginning of a tag.
    ">"
         if there has been an erroneously created start tag, then the
         presence of ">" may accidentally cause a parser to hide a section
         of text it considers markup. Replacing ">" with &gt; will at least
         make this much less likely to pass unnoticed.
    quotes
         as previously, a browser may attempt to resolve bad markup by
         searching forward to the next quote character. It's much less
         likely to pass if quotes only appear in markup.

I also recommend validating your documents. But even validation won't
substitute for a good proofreading, given that a document that is
technically valid might still have bad markup (ie., the parser might
interpret particular combinations of erroneous markup in a valid way, hence
my stressing entity replacement).

Murray

```````````````````````````````````````````````````````````````````````````````
     Murray Altheim, Program Manager
     Spyglass, Inc., Cambridge, Massachusetts
     email: <mailto:murray@spyglass.com>
     http:  <http://www.stonehand.com/murray/murray.html>