- From: Murray Altheim <murray@spyglass.com>
- Date: Wed, 12 Jun 1996 14:24:17 -0500
- To: "Solko, Dave (SOLKODE)" <SOLKODE@exchange.uc.edu>
- Cc: www-html@w3.org
Dave Solko <SOLKODE@exchange.uc.edu> writes: >>From: murray@spyglass.com >>Sent: Thursday, June 6, 1996 9:39PM >> >>Figure 3 on page 360 of Goldfarb's "The SGML Handbook" describes the >>SGML markup characters. Those that are processed in a document instance >>need to be 'escaped'. In HTML these include STAGO ('<'), TAGC ('>'), >>LIT ('"') and ERO ('&'). > >So, in order for my HTML to be proper SGML, I have to use the escape >characters for all quotes, ampersands and angle-brackets? I have only >found the need to escape from the angle-brackets (and actually, only the >less than sign) in order for my HTML to be displayed correctly and >validate. Dave, You are correct. You only have to be sure your content doesn't "accidentally" create HTML markup: it's a matter of context. In proper context, *none* need to be replaced in a system using an accurate SGML parsing engine (that can always figure context), but on the web, where both browsers and documents are frequently broken, it's simply safer. I stated "need to" because on the Web, browser parsing engines may or will make mistakes, and some mistakes can be costly. And from looking at the percentage of invalid documents on the Web, it's better to err on the safe side. I've seen whole sections of text disappear from home pages of prominent companies because the pages were never validated, and often simply replacing those four characters with their entity equivalents would have displayed the problem, even if the document authors didn't validate their document. I advocate (particularly for machine translators) simply replacing all the "sensitive" characters with their entity equivalents, with some reasoning: "&" not followed by a space accidentally becomes an entity "<" followed by a character becomes a start tag. In some browsers, this is simply parsed as the beginning of a tag. ">" if there has been an erroneously created start tag, then the presence of ">" may accidentally cause a parser to hide a section of text it considers markup. Replacing ">" with > will at least make this much less likely to pass unnoticed. quotes as previously, a browser may attempt to resolve bad markup by searching forward to the next quote character. It's much less likely to pass if quotes only appear in markup. I also recommend validating your documents. But even validation won't substitute for a good proofreading, given that a document that is technically valid might still have bad markup (ie., the parser might interpret particular combinations of erroneous markup in a valid way, hence my stressing entity replacement). Murray ``````````````````````````````````````````````````````````````````````````````` Murray Altheim, Program Manager Spyglass, Inc., Cambridge, Massachusetts email: <mailto:murray@spyglass.com> http: <http://www.stonehand.com/murray/murray.html>
Received on Wednesday, 12 June 1996 14:26:11 UTC