Date: Wed, 27 Aug 1997 15:45:14 +1000 (EST) From: Alexandre Rafalovitch <email@example.com.EDU.AU> To: firstname.lastname@example.org Message-ID: <Pine.SOL.3.95.970827151137.28107B-100000@charlie> Subject: YAWB: trying to follow TFM Hi, I am doing some small testing of current web browsers trying to understand how far are they from standard/people wishes (I know those two might contradict sometimes). I have several points which I would like to discuss. Why? I am writing a YAWB (yet another web browser) in Java and I am trying to make it follow standards as much as possible. The things that puzzle me in my work: 1) I found basic html/sgml parser at <http://www.w3.org/MarkUp/SGML/sgml-lex/sgml-lex> and was going to use it as a base of my lexer/parser. But I was testing some of the things that should be tags/text/errors on current web browsers and saw very different behaviour. Eg. Netscape3 would treat <234> as text, but </234> as tag(undisplayed). MSIE, treat both as tags and ignore them. Even more interesting things happen with the following file: --- START ---- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <!----------> <BR> <!> <BR> Text(6): <BR> <! doctype> <BR> <!,doctype> <BR> <!23> <BR> <!- xxx -> <BR> <!-> <BR> <!-!> <BR> --- END --- MSIE would not even open the file, Netscape opens it but only displays Text(6) line considering everything else tags even though html/sgml document said it is not. 2) How should <UL>some text <LI> some more text </LI> even more text </UL> be treated by a PROPER browser. All the once I have tested, treat non-LIed text as normal text with offset to the right. Reading SGML book seem to indicate that it should be treated as <UL> <LI>some text <LI> some more text <LI> some more text </UL> (by tag minimization logic). Which way is proper/more desirable. 3) Entities: What should a browser do when it meets unknown entity as in &foo;. Should it display it, skip it or put some default character there? 4) Ignoring NL before/after tag (from HTML4 whitespace handling section). I understand the concept in general, but I don't understand what should happen when there is NL+Whitespace in position where NL by itself would be ignored. Should it still ignore it all together or should it eat NL, but Whitespace would become collapsed space. Also, I am not sure whether any browsers do anything about such situation and whether it is seriously needed (it could mean some overhead on parser/lexer :-} ). Thanks for any help, Alex. Ps. Any RTFM <URL> before too late in private email would also be greatefully accepted. Same goes for "That feature is NEEDED" (I will try to implement LINK elements as menus and other obvious things of course).