Re: SGML and HTML

Hi,

> > Recently I realized that some constructs I considered a syntax
> > error, are actually allowed in SGML.
> 
> Such things come as surprises to the great majority of Web authors.

I'm not a web author. If I was, I probably wouldn't care -- I'd just
stick to XHTML and be happy. But I'm not a web author. I'm a browser
programmer. So I have to care :-(

> > and the HTML standard while *recommending* usage only of a few
> > widely accepted SGML constructs, doesn't really seem to forbid
> > anything allowed in SGML...
> 
> Well, the HTML 4.01 "standard" claims that HTML is an application of
> SGML, but extremely few browsers have even tried to implement HTML
> that way. Maybe it would be better if the HTML "standard" simply
> defined HTML as an SGML application _without certain SGML features_
> that can be dropped out when an SGML application is defined. It's a
> mystery to me why those features were included in the first place.

Well, once knowing how handy the shorttag features are, I really do
understand why the original creators of HTML didn't want to leave them
out... I only wished early HTML authors would have actually used that
features, so browser programmers would have been forced to implement
them. It doesn't look like that's terribly hard, either.

On the other hand, I really wished they turned datatag off. It's error
prone, hard to implement, confuses syntax highlighters, and doesn't
offer any real benefit either. Same for omit starttag -- this is really
nasty from an implementator's point of view.

> Perhaps in a spirit of optimism, or ignorance (of what they really
> mean).

Well, that's not the first time I wonder whether w3c keeps any touch
with reality...

> Should the "standard" be changed, then?

I believe so. The HTML 4 standard text explicitely mentions somewhere
that the aim is to write down "best current practise" -- it would be
just consequent to include the practise that no browsers implement
certain SGML features. (Not even mentioning how useful such a move would
be...)

> Well, they just did. XHTML makes the verbose canonical syntax a must.

Sure. From a technical point of view, XTHML is the solution. From a
pratical one, it's a flop. (A "marketing" flop, specifically.)

Hardly anyone is using XHTML. To the average web author, it looks like
something completely new to learn, which is meant as a substitution for
HTML, but doesn't really offer any benefits, and isn't even supported by
all browsers. So why should they switch?

How many web authors know that it's actually just a new version of HTML
with somewhat stricter syntax rules? (How many even know that XHTML
exists?...)

Instead of this XML/XHTML buzzword shit, they should have just released
HTML 5 (or better even some earlier version) with stricter syntax rules
years ago. People would know that it's the natural successor, and accept
the (really small) update.

I still keep a little fancy that the convenient improvements in XHTML 2
will arouse some interest... But for the most part, I've given up hope
that good old HTML, with all this lovely problems, will ever go away :-(

(I guess it would require considerably greater *practical* advances to
switch people over.)

> > Is this really the right conclusion: Is everything that's legal in
> > SGML/accepted by the validator automaticaly valid HTML,
> 
> By definition, yes. Unless you wish to assign confusing meanings to
> "valid" in this context, but then everyone and his dog can assign a
> new meaning to "valid".

I see now that my wording was wrong. I've read the thread on the new
validator beta in the archive. (BTW I do not think that checking browser
compatibility in addition to validity is inherently a bad idea; though I
completely agree that the way it is implemented now, it's totally bogus
and would be better left out...)

What I really meant to ask: Can a HTML document be called "correct"
(without assigning any specific technical meaning to that term), if it's
formally valid, but doesn't follow the recommendations about SGML usage
mentioned in the standard?...

Or on a more practical view: Should a browser, in this situation we are
in, try to implement as much of SGML as possible, even if nobody can use
it anyways? And is it OK to report constructs that are handled
incorrectly by most browsers as "errors"?

(Note that practically, they almost certainly *are* errors on the
author's side, as no author would deliberately use features that he
knows they will not have the desired effect in existing browsers.
Unescaped '<' or '&' characters for example genereally mean that the
author either has forgotten to escape them, or doesn't know that
browsers handle this inconsistently. The question is only whether it is
OK also to *call* them "errors", although formally the document is
valid...)

Well, I'm aware this list is not really the right place to ask such
questions... I just can't think of any forum where I could get an answer
that is authoritative in any sense :-(

> > Another interesting issue is error recovery. The validator for
> > example seems to immediately stop tag parsing and continue in
> > content parsing when encountering an illegal character, but skips
> > misformed attribute values or invalid declarations.
> 
> That depends on how the validator has been coded. It could be improved
> in many ways, but intelligent error recovery is really difficult.

You needn't tell me that. I've already spent way too much time
implementing error handling AI :-(

I'll probably end up with the smartest and most SGML-compliant HTML
parser, and still won't have made the browser any more usable by this
:-( I seriously wonder whether I wouldn't be better off just to rip out
all warnings and not care about a few broken pages -- like all other
existing browsers...

> > Is there some kind of specification or recommendation for such
> > behaviour, or is the browser free to handle syntax errors as it
> > likes?
> 
> The HTML 4.01 specification imposes no requirements on error handling.

I was thinking of the SGML standard here...

One of my fundamental problems is that I have no access to the SGML
standard, and so far I also wasn't able to find any other useful
ressource about SGML syntax on the web.

I was able to figure out much by various hints and by feeding tricky
test cases to the validator. But I've still no idea why an SGML parser
will accept <hr/> for example -- according to the BNF productions (the
only part of the standard I could find on the web), a net-enabling start
tag is never explicitely closed, so should not the > be treated as
content?...

> If you ask me, that's a mistake, but that's the HTML tradition. XML
> has different rules, but we'll see seriously they will be taken. I
> don't think browsers even try to play by XML rules for XHTML
> documents.

That's not really true. I don't know about other browsers; at least
Mozilla does the correct thing when a document has the right MIME-type.
(I.e. a local document with .xhtml suffix or a HTTP page declared as
application/xhtml+xml.)

The problem is that web servers usually deliver XTHML documents as
normal HTML -- which is probably a good idea, as otherwise older
browsers wouldn't be able to handle it at all, although it's compatible
otherwise...

The only question here is whether it wasn't better if browsers aware of
XHTML ignored the MIME type in such a situation? But only the major
vendors are really in a position to decide on this -- as long as the
popular browsers accecpt broken (so called) XHTML, every browser not
doing so will close itself out completely :-(

I recently visited some moron's site, who claims that his pages are
XHTML (as he has "moved on"), and if some browser doesn't display them
correctly it's not his fault -- but in the next sentence stated he
doesn't care his pages do not validate (actually, they aren't even
well-formed), as he "ceased to ba a perfectionist"...

-Olaf-

-- 
Don't buy away your freedom -- GNU/Linux

Received on Sunday, 7 September 2003 12:49:22 UTC