YAWB: trying to follow TFM

Alexandre Rafalovitch (arafalov@socs.uts.EDU.AU)
Wed, 27 Aug 1997 15:45:14 +1000 (EST)


Date: Wed, 27 Aug 1997 15:45:14 +1000 (EST)
From: Alexandre Rafalovitch <arafalov@socs.uts.EDU.AU>
To: www-html@w3.org
Message-ID: <Pine.SOL.3.95.970827151137.28107B-100000@charlie>
Subject: YAWB: trying to follow TFM

Hi,

I am doing some small testing of current web browsers trying to understand 
how far are they from standard/people wishes (I know those two might
contradict sometimes). I have several points which I would like to
discuss. Why? I am writing a YAWB (yet another web browser) in Java and I
am trying to make it follow standards as much as possible. 

The things that puzzle me in my work:

1) I found basic html/sgml parser at
<http://www.w3.org/MarkUp/SGML/sgml-lex/sgml-lex> and was going to use it
as a base of my lexer/parser. 

But I was testing some of the things that
should be tags/text/errors on current web browsers and saw very different
behaviour. Eg. Netscape3 would treat <234> as text, but </234> as
tag(undisplayed). MSIE, treat both as tags and ignore them. Even more
interesting things happen with the following file:
--- START  ----

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> 

<!----------> <BR>
<!> <BR>

Text(6): <BR>

<! doctype> <BR>
<!,doctype> <BR>
<!23> <BR>
<!- xxx -> <BR>
<!-> <BR>
<!-!> <BR>

--- END ---

MSIE would not even open the file, Netscape opens it but only displays
Text(6) line considering everything else tags even though html/sgml
document said it is not. 

2) How should <UL>some text <LI> some more text </LI> even more text </UL>
be treated by a PROPER browser. All the once I have tested, treat non-LIed
text as normal text with offset to the right. Reading SGML book seem to
indicate that it should be treated as <UL> <LI>some text <LI> some more
text <LI> some more text </UL> (by tag minimization logic). Which way is
proper/more desirable.

3) Entities: What should a browser do when it meets unknown entity as in
&foo;. Should it display it, skip it or put some default character there?

4) Ignoring NL before/after tag (from HTML4 whitespace handling section).
I understand the concept in general, but I don't understand what should
happen when there is NL+Whitespace in position where NL by itself would be
ignored. Should it still ignore it all together or should it eat NL, but
Whitespace would become collapsed space. Also, I am not sure whether any
browsers do anything about such situation and whether it is seriously
needed (it could mean some overhead on parser/lexer :-} ).

Thanks for any help,
   Alex.
Ps. Any RTFM <URL> before too late in private email would also be
greatefully accepted. Same goes for "That feature is NEEDED" (I will
try to implement LINK elements as menus and other obvious things of
course).