Re: Are Heresies Allowed? (Was: RS/RE) from Arjun Ray on 1996-09-15 (w3c-sgml-wg@w3.org from September 1996)

From: Arjun Ray <aray@nmds.com>
Date: Sun, 15 Sep 1996 00:11:29 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <1.5.4.32.19960915041129.002e48fc@www.nmds.com>

At 04:05 PM 9/14/96 -0400, David G. Durand wrote:
>At 12:33 AM 9/14/96, Arjun Ray wrote:

>>In the context of essentially free-form text and markup, exactly what does
>> a record-end/end-of-line/whatchamacallit mean?
>>
>>   1. Is it part of the instance text?
>>   2. Is it (processable) markup?
>>   3. Is it an artifact of the storage strategy the environment was too
>>      brain-dead not to have encapsulated?
>>
>>In some ways, #3 is a special case of (in the sense of imposition on) #2.
>>And #1 is clearly unworkable. So, if we want simple and strong rules [...]
>> the rational approach IMHO is to treat these animals as markup always. 
>>When needed as instance text, an inline escape mechanism should suffice 
>>(how about '\' as MSSCHAR?). The problem is reduced to one of lexical
>>tokenization, which is where I believe it always belonged.
>
>Funny, I think #1 is clearly correct. Treat CR, LF, and their kin as SGML
>has always treated tab and space (ignored in element content, parsed as
>data elsewhere). 

There's no elsewhere without a DTD: something that tells the parser what
"mode" to be in. The issue here is lexical tokenization per se. Treating
these as data will require a lexer to *report* a different sequence of
tokens for

   <foo><bar>blah

as opposed to

   <foo>
   <bar>blah

which would defeat the purpose of freeform in markup entirely. OTOH, my
intuitive view of what "freeform" means is that all whitespace between tags
should not be significant *unless explicitly indicated*. (There's also an
issue of leading and trailing ws in data that I elide just now.) Since most 
of the time such whitespace will be a &newline-indicator;, it makes sense
to treat them as markup, apply a canonical rule that transforms it to ws,
and then apply some commonsense rules regarding whitespace. 

>I think that the byte-stream file has won, and we should just treat CR and
>LF as more whitespace bytes. 

Agreed, whole-heartedly. So the point is to have simple and strong treatment
of "free-standing" whitespace -- between tags, before and after text data.
(I'm sorta coming around to the view that all whitespace is better viewed
as markup rather than data:-))

But how all this interacts with RS/RE gotchas and the mixed content problem
I confess I'm still confused. (SGML is the only system I've seen where
innocent cosmetic whitespace could lead one off the Path of Sufficient Virtue.)

Regards,
Arjun

Received on Sunday, 15 September 1996 00:09:52 UTC