[Prev][Next][Index][Thread]

Re: Are Heresies Allowed? (Was: RS/RE)



At 04:05 PM 9/14/96 -0400, David G. Durand wrote:
>At 12:33 AM 9/14/96, Arjun Ray wrote:

>>In the context of essentially free-form text and markup, exactly what does
>> a record-end/end-of-line/whatchamacallit mean?
>>
>>   1. Is it part of the instance text?
>>   2. Is it (processable) markup?
>>   3. Is it an artifact of the storage strategy the environment was too
>>      brain-dead not to have encapsulated?
>>
>>In some ways, #3 is a special case of (in the sense of imposition on) #2.
>>And #1 is clearly unworkable. So, if we want simple and strong rules [...]
>> the rational approach IMHO is to treat these animals as markup always. 
>>When needed as instance text, an inline escape mechanism should suffice 
>>(how about '\' as MSSCHAR?). The problem is reduced to one of lexical
>>tokenization, which is where I believe it always belonged.
>
>Funny, I think #1 is clearly correct. Treat CR, LF, and their kin as SGML
>has always treated tab and space (ignored in element content, parsed as
>data elsewhere). 

There's no elsewhere without a DTD: something that tells the parser what
"mode" to be in. The issue here is lexical tokenization per se. Treating
these as data will require a lexer to *report* a different sequence of
tokens for

   <foo><bar>blah

as opposed to

   <foo>
   <bar>blah

which would defeat the purpose of freeform in markup entirely. OTOH, my
intuitive view of what "freeform" means is that all whitespace between tags
should not be significant *unless explicitly indicated*. (There's also an
issue of leading and trailing ws in data that I elide just now.) Since most 
of the time such whitespace will be a &newline-indicator;, it makes sense
to treat them as markup, apply a canonical rule that transforms it to ws,
and then apply some commonsense rules regarding whitespace. 

>I think that the byte-stream file has won, and we should just treat CR and
>LF as more whitespace bytes. 

Agreed, whole-heartedly. So the point is to have simple and strong treatment
of "free-standing" whitespace -- between tags, before and after text data.
(I'm sorta coming around to the view that all whitespace is better viewed
as markup rather than data:-))

But how all this interacts with RS/RE gotchas and the mixed content problem
I confess I'm still confused. (SGML is the only system I've seen where
innocent cosmetic whitespace could lead one off the Path of Sufficient Virtue.)


Regards,
Arjun