Re: Parsing methods

Murray Altheim (murray@spyglass.com)
Wed, 10 Jul 1996 21:03:21 -0500


Message-Id: <v0211010aae09ffbf5c1a@[140.186.34.50]>
Date: Wed, 10 Jul 1996 21:03:21 -0500
To: Lee Daniel Crocker <lee@piclab.com>
From: murray@spyglass.com (Murray Altheim)
Subject: Re: Parsing methods
Cc: www-html@w3.org, connolly@beach.w3.org

Lee Daniel Crocker <lee@piclab.com> writes:
Daniel W. Connolly <connolly@beach.w3.org> writes:
>> >IE: should the parser see
>> >     <hello%^ myname=foo>
>> >as a TAG that was messed up........
>> >     OR
>> >as plain text?
>> >
>> >i say as a messed up tag.....
>>
>> And you'd be right.
>>
>> If you want to be sure, check with a validating SGML parser.
>
>As much as I would like to see producers use validation, and
>as useful as general-purpose SGML is to unambiguous communication
>of structured information, I must express a fundamental
>disagreement with Dan and some other SGML-heads on how to handle
>"invalid" SGML.
>
>I think that  human-written text-based format like SGML _should
>not have erors_, period.  I.e., the language should be a way
>to interpret whatever the hell the writer throws at you. [...]

I don't know why you assume that HTML is necessarily "human-written." You
might take a look at a good quality editor that enforces content models,
such as FrameMaker+SGML or SoftQuad's HoTMetaL Pro. The rest of your
statement seems in contradiction with the rest of your points: ie., that
SGML (of which HTML is an application) "should not have errors." I assume
you actually mean "should not hold to the principle of valid markup", but
then I must disagree. [But I suppose I'm as much as anyone an "SGML-head".]

>  If it
>is clear and unambiguous HTML, great--interpret it that way and
>go on.  If not, I believe a reader should try to be flexible,
                             [browser?]
>and in most cases, just print questionable markup as is.  It is
>far more useful for a reader to see something like &emdas; on
>his screen when the author meant &emdash; than to see some
>meaningless error.  And for a parser to throw up its hands and
>refuse to parse <.. width=50%> rather than have some rules for
>dealing with markup like this.

Obviously, current browsers attempt this. It's simply unfortunate that
users aren't aware when the document displayed contains errors that may
affect the displayed content. And the kinds of errors I've seen have much
more profound consequences.

We recently saw a Humana Inc. 8-K report containing 220 HTML errors. The
original document is 40 printed pages long. Due to errors, most of the
document does not even *appear* in some browsers. Another corporate home
page is missing its entire product announcement section. I could cite
dozens of examples.

You can't make parsing rules about how to consistently handle inconsistent
content. If the browser developer has to guess as to what broken markup
means, how can anyone be sure that two developers make the same guess?

>If that means two separate sets of rules for readers and writers,
>then sobeit.

It's really not a matter of two sets of rules, it's a matter of being able
to guarantee that readers are actually viewing what authors are producing.
Invalid content almost guarantees that those readers not using the same
"browser used to validate the document" as the author will NOT view the
document the way the author intended.

If an airline, hospital or bank can't guarantee that documentation isn't
missing information, how could they use the web? How would you like it if
your bank or hospital used a browser to view documentation, and missed some
important information (such as yesterday's transactions, or some
prescription warning text)? If your airline mechanic missed part of a
repair because the document was invalid? How can anyone ever use HTML for
"serious" content if there isn't any guarantee that it's being displayed
correctly?

Inconsistent behavior can be much more than an annoyance, it could cost
someone a lot of money, or worse. And even valid markup is not necessarily
what the author intended, but validation usually points out the errors.

The case has been made over and over for valid content markup. I don't know
why this is so hard for folks to understand. It certainly doesn't make you
an "SGML-head" to want to guarantee your content is being seen.

Murray

```````````````````````````````````````````````````````````````````````````````
     Murray Altheim, Program Manager
     Spyglass, Inc., Cambridge, Massachusetts
     email: <mailto:murray@spyglass.com>
     http:  <http://www.stonehand.com/murray/murray.html>