Re: YAWB: trying to follow TFM (long reply)

Alexandre Rafalovitch writes:
   I am doing some small testing of current web browsers trying to understand 
   how far are they from standard/people wishes (I know those two might
   contradict sometimes). I have several points which I would like to
   discuss. Why? I am writing a YAWB (yet another web browser) in Java and I
   am trying to make it follow standards as much as possible. 

For a formal SGML parse and validate, use nsgmls (part of the SP suite
of SGML tools, see http://www.jclark.com).

   The things that puzzle me in my work:

   1) I found basic html/sgml parser at
   <http://www.w3.org/MarkUp/SGML/sgml-lex/sgml-lex> and was going to use it
   as a base of my lexer/parser. 

I don't do lex so I don't know if this follows SGML fully or not.

   But I was testing some of the things that
   should be tags/text/errors on current web browsers and saw very different
   behaviour. Eg. Netscape3 would treat <234> as text, but </234> as
   tag(undisplayed). MSIE, treat both as tags and ignore them. 

Both <234> and </234> are garbage in terms of HTML and should be
rejected out of hand as gross errors. It think it is possible to make
them valid SGML, but only by surgery on the SGML Declaration, and I
can't think offhand of many applications that would need element names
to be all digits.

   Even more
   interesting things happen with the following file:

I'm sure browsers do interesting things with this, but it's so far
from being anything which resembles HTML that I wouldn't bother.

   <!----------> <BR>

C:\SP\BIN\NSGMLS.EXE:test.sgml:13:11:E: unterminated comment: found
end of entity inside comment
C:\SP\BIN\NSGMLS.EXE:test.sgml:2:10: comment started here
C:\SP\BIN\NSGMLS.EXE:test.sgml:13:11:E: no document element

   <!> <BR>

   Text(6): <BR>

   <! doctype> <BR>
   <!,doctype> <BR>
   <!23> <BR>
   <!- xxx -> <BR>
   <!-> <BR>
   <!-!> <BR>

C:\SP\BIN\NSGMLS.EXE:test.sgml:2:3:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:4:7:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:6:0:E: character data is not allowed
here
C:\SP\BIN\NSGMLS.EXE:test.sgml:6:12:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:8:15:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:9:15:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:10:9:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:11:14:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:12:8:E: document type does not allow
element "BR" here
C:\SP\BIN\NSGMLS.EXE:test.sgml:13:9:E: document type does not allow
element "BR" here

   MSIE would not even open the file, Netscape opens it but only displays
   Text(6) line considering everything else tags even though html/sgml
   document said it is not. 

That sounds about right: when browsers encounter garbage they are
expected to degrade gracefully.

   2) How should <UL>some text <LI> some more text </LI> even more text </UL>
   be treated by a PROPER browser. 

C:\SP\BIN\NSGMLS.EXE:test.sgml:6:12:E: start tag for "LI" omitted, but
its declaration does not permit this
C:\SP\BIN\NSGMLS.EXE:test.sgml:6:48:E: character data is not allowed
here

   All the once I have tested, treat non-LIed
   text as normal text with offset to the right. Reading SGML book seem to
   indicate that it should be treated as <UL> <LI>some text <LI> some more
   text <LI> some more text </UL> (by tag minimization logic). Which way is
   proper/more desirable.

The way you describe it is correct.

   3) Entities: What should a browser do when it meets unknown entity as in
   &foo;. Should it display it, skip it or put some default character there?

HTML declares no default entity, so it should (IMHO) display &foo; and
complain that the author has not provided a declaration for it.

   4) Ignoring NL before/after tag (from HTML4 whitespace handling section).
   I understand the concept in general, but I don't understand what should
   happen when there is NL+Whitespace in position where NL by itself would be
   ignored. Should it still ignore it all together or should it eat NL, but
   Whitespace would become collapsed space. Also, I am not sure whether any
   browsers do anything about such situation and whether it is seriously
   needed (it could mean some overhead on parser/lexer :-} ).

This is a FAQ. The handling of white-space in SGML is up to the
application (at the level of the browser) but the formality of it at
the level of validation is trickier. The easiest way to explain it is
to divide the elements in your DTD into two groups

   1. those that permit character data inside them, possibly along
      with other markup; typically these are elements like <P>

   2. those that permit ONLY other elements, such as <OL> or <UL>

Elements in class 2 are sometimes called "structural" elements, along
with elements in class 1, BUT NOT the elements which occur INSIDE
elements in class 1. These latter are sometimes called "inline"
elements. Users of formatting systems usually differentiate these on
the basis that the former cause vertical white-space in formatting
where the latter do not.

White-space which occurs BETWEEN elements in class 2 is called
"insignificant" white-space, and it can be ignored or discarded
without damage to the integrity of the document, eg 

     <OL>     <LI>foo</LI>
           <LI>bar</LI>
                 </OL>

is equivalent to

   <OL><LI>foo<LI>bar</OL>

for all practical purposes (it's actually not, if you dig into the
murkier depths of ISO 8879, but it'll do for now).

White-space that occurs anywhere INSIDE elements in class 1, and
usually inside elements inside them, is "significant" white-space: it
should be passed by the parser/validator to the application untouched,
as to remove it (eg between words) would damage the integrity of the
document. These elements are said to have "mixed content" because they
can contain both text and markup interspersed.

Browsers are by convention expected to silently remove all leading
white-space occurring immediately inside the start-tag of an element
in class 1, and immediately before the end-tag as well, eg

     <P>   foo bar   </P>
        ^^^       ^^^
        here and  here

but not inside any deeper-nested elements, eg

     <P>foo <em>   bar</em> ...
                ^^^
              this should be left alone

unless they are class 2 elements permitted to occur inside class 1
elements (see example below).

In the case of some DTDs (early versions of HTML), the content model
of <P> permitted <OL> and <UL> actually inside paragraphs. This meant
some careful application of the rules:

     <P>foo bar<OL>
       <LI>blort
       <LI>boggle
       </OL>
       and more of the same...

Which linebreaks and spaces are significant here and which are
insignificant?

The final problem derives from this: some DTDs permit elements to
occur inside other elements in a sequence which makes a linebreak
illegal. For example, if <P> could contain text (parsed character
data) followed by <OL> or <UL> (only), then the above example without
the last phrase:

     <P>foo bar<OL>
       <LI>blort
       <LI>boggle
       </OL>
     </p>

would be an error, because the linebreak between </OL> and <P> would
be interpreted as character data, and the model said that character
data can only be followed by <OL> or <UL> and nothing else after that.
You will hear this called "pernicious mixed content" and it is usually
regarded as evil.

The above contains deliberate simplifications (and I hope no errors:
if there are, someone please shout, it's late and I need caffeine).

///Peter

Received on Thursday, 28 August 1997 17:13:51 UTC