- From: <jose.kahan@w3.org>
- Date: Mon, 15 Dec 1997 15:08:30 +0100 (MET)
- To: www-html@w3.org
In our previous episode, Peter Flynn said: From www-html-request@w3.org Mon Dec 15 14:57 MET 1997 Date: Mon, 15 Dec 1997 08:56:22 -0500 (EST) X-Envelope-From: www-html-request@www10.w3.org Mon Dec 15 08:55:54 1997 Old-Date: 15 Dec 1997 13:56:19 +0000 (GMT) From: Peter Flynn <pflynn@imbolc.ucc.ie> In-reply-to: <Pine.SOL.3.95.971207192903.26097C-100000@sally> (message from Alexandre Rafalovitch on 07 Dec 1997 19:40:32 +1100 (EST)) To: arafalov@socs.uts.EDU.AU Cc: www-html@w3.org Message-id: <199712151356.NAA12179@imbolc.ucc.ie> X-Envelope-to: www-html@w3.org Content-transfer-encoding: 7BIT X-Diagnostic: Not on the accept list Subject: [Spam?] Re: Fun with ignorable whitespace definition. X-Diagnostic: Mail coming from a daemon, ignored X-Envelope-To: www-html Content-Type: text Content-Length: 3456 Consider how the following html will be parsed. <html> <title>Title text</title> <meta foo=bar> Now we have the body. </html> In here, we have some whitespace after </title>. It is not quite ignorable, so it should be treated as text. Therefore, it should close the head element and start body element. As a result, 'meta' would go into the body and not the head and will be ignored. Not quite. The spec is very clear on this: you are missing <HEAD>, so assuming your DTD allows this, the presence of <TITLE> means you are inside the <HEAD> at that stage. Now <HEAD> is allowed to contain ONLY element content, never any character data (the only allowable character data is inside <TITLE>, <SCRIPT> or <STYLE>). The white-space between </TITLE> and <META> is therefore still within the <HEAD> and MUST therefore be discarded by the parser as insignificant. This is not an option, it is compulsory. The character data after </META> implies the end of </HEAD> and the start of <BODY>, which has mixed content in most HTML DTDs, so any white space between there and </HTML> is _significant_ and must be retained. The result of parsing your example would therefore give (normalized, eliding the invalid foo=bar, and making some rather large assumptions about the DTD): <html><head><title>Title text</title><meta foo=bar></head><body> Now we have the body. </body></html> I strongly advise you to use real HTML and not your imagination, otherwise when XML becomes usable you will be left with a load of untranslatable pseudo-HTML. That presents IMHO a problem, since the meaning was clearly to ignore all whitespace in the head, but having optional end of head and optional start for body messes it up. No, the HTML spec is real SGML and parses correctly unless you mess it up. On another hand, parser cannot just ignore that whitespace as it does not know (in a generic html parsing world) if content of html (or head) can be displayed and CSS might declare it to have non-collapsable whitespace (like in PRE). It looks to me, that either this requires a heavy special case or html4 draft is missing a section on how a whitespace treated in non-displayable optional start/end elements... :=} I hope I am missing something, because it sure got me thinking. :-} The rules on white-space in SGML are tricky, but basically 1. in element content (ie places where only more markup is allowed, never any character data), all white-space must be removed. 2. in mixed content (ie places where intermingled markup and character data are allowed), white-space is preserved because it is a part of the character data. 3. line-breaks are also character data in mixed content. I know it's moot while HTML-only browsers continue to ignore SGML, but XML has much simpler rules and from the look of last week's SGML/XML'97 Conference, it's going to arrive quickly (I already have a couple of very neat beta XML editors and there are at least a dozen more on the way, and even some stylesheet editors for XSL). If you currently create HTML, I do recommend that you start to shift NOW to creating only valid, parsable HTML, so that if/when you want to move into XML, you can translate your files automatically. Otherwise you are going to have an appalling manual job to do (you may already be facing one if your existing HTML is currently invalid). ///Peter
Received on Monday, 15 December 1997 09:08:59 UTC