- From: Peter Flynn <pflynn@imbolc.ucc.ie>
- Date: 28 Aug 1997 21:55:41 +0100
- To: arafalov@socs.uts.EDU.AU
- Cc: www-html@w3.org
Alexandre Rafalovitch writes: I am doing some small testing of current web browsers trying to understand how far are they from standard/people wishes (I know those two might contradict sometimes). I have several points which I would like to discuss. Why? I am writing a YAWB (yet another web browser) in Java and I am trying to make it follow standards as much as possible. For a formal SGML parse and validate, use nsgmls (part of the SP suite of SGML tools, see http://www.jclark.com). The things that puzzle me in my work: 1) I found basic html/sgml parser at <http://www.w3.org/MarkUp/SGML/sgml-lex/sgml-lex> and was going to use it as a base of my lexer/parser. I don't do lex so I don't know if this follows SGML fully or not. But I was testing some of the things that should be tags/text/errors on current web browsers and saw very different behaviour. Eg. Netscape3 would treat <234> as text, but </234> as tag(undisplayed). MSIE, treat both as tags and ignore them. Both <234> and </234> are garbage in terms of HTML and should be rejected out of hand as gross errors. It think it is possible to make them valid SGML, but only by surgery on the SGML Declaration, and I can't think offhand of many applications that would need element names to be all digits. Even more interesting things happen with the following file: I'm sure browsers do interesting things with this, but it's so far from being anything which resembles HTML that I wouldn't bother. <!----------> <BR> C:\SP\BIN\NSGMLS.EXE:test.sgml:13:11:E: unterminated comment: found end of entity inside comment C:\SP\BIN\NSGMLS.EXE:test.sgml:2:10: comment started here C:\SP\BIN\NSGMLS.EXE:test.sgml:13:11:E: no document element <!> <BR> Text(6): <BR> <! doctype> <BR> <!,doctype> <BR> <!23> <BR> <!- xxx -> <BR> <!-> <BR> <!-!> <BR> C:\SP\BIN\NSGMLS.EXE:test.sgml:2:3:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:4:7:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:6:0:E: character data is not allowed here C:\SP\BIN\NSGMLS.EXE:test.sgml:6:12:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:8:15:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:9:15:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:10:9:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:11:14:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:12:8:E: document type does not allow element "BR" here C:\SP\BIN\NSGMLS.EXE:test.sgml:13:9:E: document type does not allow element "BR" here MSIE would not even open the file, Netscape opens it but only displays Text(6) line considering everything else tags even though html/sgml document said it is not. That sounds about right: when browsers encounter garbage they are expected to degrade gracefully. 2) How should <UL>some text <LI> some more text </LI> even more text </UL> be treated by a PROPER browser. C:\SP\BIN\NSGMLS.EXE:test.sgml:6:12:E: start tag for "LI" omitted, but its declaration does not permit this C:\SP\BIN\NSGMLS.EXE:test.sgml:6:48:E: character data is not allowed here All the once I have tested, treat non-LIed text as normal text with offset to the right. Reading SGML book seem to indicate that it should be treated as <UL> <LI>some text <LI> some more text <LI> some more text </UL> (by tag minimization logic). Which way is proper/more desirable. The way you describe it is correct. 3) Entities: What should a browser do when it meets unknown entity as in &foo;. Should it display it, skip it or put some default character there? HTML declares no default entity, so it should (IMHO) display &foo; and complain that the author has not provided a declaration for it. 4) Ignoring NL before/after tag (from HTML4 whitespace handling section). I understand the concept in general, but I don't understand what should happen when there is NL+Whitespace in position where NL by itself would be ignored. Should it still ignore it all together or should it eat NL, but Whitespace would become collapsed space. Also, I am not sure whether any browsers do anything about such situation and whether it is seriously needed (it could mean some overhead on parser/lexer :-} ). This is a FAQ. The handling of white-space in SGML is up to the application (at the level of the browser) but the formality of it at the level of validation is trickier. The easiest way to explain it is to divide the elements in your DTD into two groups 1. those that permit character data inside them, possibly along with other markup; typically these are elements like <P> 2. those that permit ONLY other elements, such as <OL> or <UL> Elements in class 2 are sometimes called "structural" elements, along with elements in class 1, BUT NOT the elements which occur INSIDE elements in class 1. These latter are sometimes called "inline" elements. Users of formatting systems usually differentiate these on the basis that the former cause vertical white-space in formatting where the latter do not. White-space which occurs BETWEEN elements in class 2 is called "insignificant" white-space, and it can be ignored or discarded without damage to the integrity of the document, eg <OL> <LI>foo</LI> <LI>bar</LI> </OL> is equivalent to <OL><LI>foo<LI>bar</OL> for all practical purposes (it's actually not, if you dig into the murkier depths of ISO 8879, but it'll do for now). White-space that occurs anywhere INSIDE elements in class 1, and usually inside elements inside them, is "significant" white-space: it should be passed by the parser/validator to the application untouched, as to remove it (eg between words) would damage the integrity of the document. These elements are said to have "mixed content" because they can contain both text and markup interspersed. Browsers are by convention expected to silently remove all leading white-space occurring immediately inside the start-tag of an element in class 1, and immediately before the end-tag as well, eg <P> foo bar </P> ^^^ ^^^ here and here but not inside any deeper-nested elements, eg <P>foo <em> bar</em> ... ^^^ this should be left alone unless they are class 2 elements permitted to occur inside class 1 elements (see example below). In the case of some DTDs (early versions of HTML), the content model of <P> permitted <OL> and <UL> actually inside paragraphs. This meant some careful application of the rules: <P>foo bar<OL> <LI>blort <LI>boggle </OL> and more of the same... Which linebreaks and spaces are significant here and which are insignificant? The final problem derives from this: some DTDs permit elements to occur inside other elements in a sequence which makes a linebreak illegal. For example, if <P> could contain text (parsed character data) followed by <OL> or <UL> (only), then the above example without the last phrase: <P>foo bar<OL> <LI>blort <LI>boggle </OL> </p> would be an error, because the linebreak between </OL> and <P> would be interpreted as character data, and the model said that character data can only be followed by <OL> or <UL> and nothing else after that. You will hear this called "pernicious mixed content" and it is usually regarded as evil. The above contains deliberate simplifications (and I hope no errors: if there are, someone please shout, it's late and I need caffeine). ///Peter
Received on Thursday, 28 August 1997 17:13:51 UTC