- From: <olafBuddenhagen@web.de>
- Date: Sun, 7 Sep 2003 18:48:21 +0200
- To: www-validator@w3.org
Hi, > > Recently I realized that some constructs I considered a syntax > > error, are actually allowed in SGML. > > Such things come as surprises to the great majority of Web authors. I'm not a web author. If I was, I probably wouldn't care -- I'd just stick to XHTML and be happy. But I'm not a web author. I'm a browser programmer. So I have to care :-( > > and the HTML standard while *recommending* usage only of a few > > widely accepted SGML constructs, doesn't really seem to forbid > > anything allowed in SGML... > > Well, the HTML 4.01 "standard" claims that HTML is an application of > SGML, but extremely few browsers have even tried to implement HTML > that way. Maybe it would be better if the HTML "standard" simply > defined HTML as an SGML application _without certain SGML features_ > that can be dropped out when an SGML application is defined. It's a > mystery to me why those features were included in the first place. Well, once knowing how handy the shorttag features are, I really do understand why the original creators of HTML didn't want to leave them out... I only wished early HTML authors would have actually used that features, so browser programmers would have been forced to implement them. It doesn't look like that's terribly hard, either. On the other hand, I really wished they turned datatag off. It's error prone, hard to implement, confuses syntax highlighters, and doesn't offer any real benefit either. Same for omit starttag -- this is really nasty from an implementator's point of view. > Perhaps in a spirit of optimism, or ignorance (of what they really > mean). Well, that's not the first time I wonder whether w3c keeps any touch with reality... > Should the "standard" be changed, then? I believe so. The HTML 4 standard text explicitely mentions somewhere that the aim is to write down "best current practise" -- it would be just consequent to include the practise that no browsers implement certain SGML features. (Not even mentioning how useful such a move would be...) > Well, they just did. XHTML makes the verbose canonical syntax a must. Sure. From a technical point of view, XTHML is the solution. From a pratical one, it's a flop. (A "marketing" flop, specifically.) Hardly anyone is using XHTML. To the average web author, it looks like something completely new to learn, which is meant as a substitution for HTML, but doesn't really offer any benefits, and isn't even supported by all browsers. So why should they switch? How many web authors know that it's actually just a new version of HTML with somewhat stricter syntax rules? (How many even know that XHTML exists?...) Instead of this XML/XHTML buzzword shit, they should have just released HTML 5 (or better even some earlier version) with stricter syntax rules years ago. People would know that it's the natural successor, and accept the (really small) update. I still keep a little fancy that the convenient improvements in XHTML 2 will arouse some interest... But for the most part, I've given up hope that good old HTML, with all this lovely problems, will ever go away :-( (I guess it would require considerably greater *practical* advances to switch people over.) > > Is this really the right conclusion: Is everything that's legal in > > SGML/accepted by the validator automaticaly valid HTML, > > By definition, yes. Unless you wish to assign confusing meanings to > "valid" in this context, but then everyone and his dog can assign a > new meaning to "valid". I see now that my wording was wrong. I've read the thread on the new validator beta in the archive. (BTW I do not think that checking browser compatibility in addition to validity is inherently a bad idea; though I completely agree that the way it is implemented now, it's totally bogus and would be better left out...) What I really meant to ask: Can a HTML document be called "correct" (without assigning any specific technical meaning to that term), if it's formally valid, but doesn't follow the recommendations about SGML usage mentioned in the standard?... Or on a more practical view: Should a browser, in this situation we are in, try to implement as much of SGML as possible, even if nobody can use it anyways? And is it OK to report constructs that are handled incorrectly by most browsers as "errors"? (Note that practically, they almost certainly *are* errors on the author's side, as no author would deliberately use features that he knows they will not have the desired effect in existing browsers. Unescaped '<' or '&' characters for example genereally mean that the author either has forgotten to escape them, or doesn't know that browsers handle this inconsistently. The question is only whether it is OK also to *call* them "errors", although formally the document is valid...) Well, I'm aware this list is not really the right place to ask such questions... I just can't think of any forum where I could get an answer that is authoritative in any sense :-( > > Another interesting issue is error recovery. The validator for > > example seems to immediately stop tag parsing and continue in > > content parsing when encountering an illegal character, but skips > > misformed attribute values or invalid declarations. > > That depends on how the validator has been coded. It could be improved > in many ways, but intelligent error recovery is really difficult. You needn't tell me that. I've already spent way too much time implementing error handling AI :-( I'll probably end up with the smartest and most SGML-compliant HTML parser, and still won't have made the browser any more usable by this :-( I seriously wonder whether I wouldn't be better off just to rip out all warnings and not care about a few broken pages -- like all other existing browsers... > > Is there some kind of specification or recommendation for such > > behaviour, or is the browser free to handle syntax errors as it > > likes? > > The HTML 4.01 specification imposes no requirements on error handling. I was thinking of the SGML standard here... One of my fundamental problems is that I have no access to the SGML standard, and so far I also wasn't able to find any other useful ressource about SGML syntax on the web. I was able to figure out much by various hints and by feeding tricky test cases to the validator. But I've still no idea why an SGML parser will accept <hr/> for example -- according to the BNF productions (the only part of the standard I could find on the web), a net-enabling start tag is never explicitely closed, so should not the > be treated as content?... > If you ask me, that's a mistake, but that's the HTML tradition. XML > has different rules, but we'll see seriously they will be taken. I > don't think browsers even try to play by XML rules for XHTML > documents. That's not really true. I don't know about other browsers; at least Mozilla does the correct thing when a document has the right MIME-type. (I.e. a local document with .xhtml suffix or a HTTP page declared as application/xhtml+xml.) The problem is that web servers usually deliver XTHML documents as normal HTML -- which is probably a good idea, as otherwise older browsers wouldn't be able to handle it at all, although it's compatible otherwise... The only question here is whether it wasn't better if browsers aware of XHTML ignored the MIME type in such a situation? But only the major vendors are really in a position to decide on this -- as long as the popular browsers accecpt broken (so called) XHTML, every browser not doing so will close itself out completely :-( I recently visited some moron's site, who claims that his pages are XHTML (as he has "moved on"), and if some browser doesn't display them correctly it's not his fault -- but in the next sentence stated he doesn't care his pages do not validate (actually, they aren't even well-formed), as he "ceased to ba a perfectionist"... -Olaf- -- Don't buy away your freedom -- GNU/Linux
Received on Sunday, 7 September 2003 12:49:22 UTC