- From: Arjun Ray <aray@nmds.com>
- Date: Sun, 15 Sep 1996 00:11:29 -0400
- To: w3c-sgml-wg@w3.org
At 04:05 PM 9/14/96 -0400, David G. Durand wrote: >At 12:33 AM 9/14/96, Arjun Ray wrote: >>In the context of essentially free-form text and markup, exactly what does >> a record-end/end-of-line/whatchamacallit mean? >> >> 1. Is it part of the instance text? >> 2. Is it (processable) markup? >> 3. Is it an artifact of the storage strategy the environment was too >> brain-dead not to have encapsulated? >> >>In some ways, #3 is a special case of (in the sense of imposition on) #2. >>And #1 is clearly unworkable. So, if we want simple and strong rules [...] >> the rational approach IMHO is to treat these animals as markup always. >>When needed as instance text, an inline escape mechanism should suffice >>(how about '\' as MSSCHAR?). The problem is reduced to one of lexical >>tokenization, which is where I believe it always belonged. > >Funny, I think #1 is clearly correct. Treat CR, LF, and their kin as SGML >has always treated tab and space (ignored in element content, parsed as >data elsewhere). There's no elsewhere without a DTD: something that tells the parser what "mode" to be in. The issue here is lexical tokenization per se. Treating these as data will require a lexer to *report* a different sequence of tokens for <foo><bar>blah as opposed to <foo> <bar>blah which would defeat the purpose of freeform in markup entirely. OTOH, my intuitive view of what "freeform" means is that all whitespace between tags should not be significant *unless explicitly indicated*. (There's also an issue of leading and trailing ws in data that I elide just now.) Since most of the time such whitespace will be a &newline-indicator;, it makes sense to treat them as markup, apply a canonical rule that transforms it to ws, and then apply some commonsense rules regarding whitespace. >I think that the byte-stream file has won, and we should just treat CR and >LF as more whitespace bytes. Agreed, whole-heartedly. So the point is to have simple and strong treatment of "free-standing" whitespace -- between tags, before and after text data. (I'm sorta coming around to the view that all whitespace is better viewed as markup rather than data:-)) But how all this interacts with RS/RE gotchas and the mixed content problem I confess I'm still confused. (SGML is the only system I've seen where innocent cosmetic whitespace could lead one off the Path of Sufficient Virtue.) Regards, Arjun
Received on Sunday, 15 September 1996 00:09:52 UTC