Re: Are Heresies Allowed? (Was: RS/RE) from David G. Durand on 1996-09-14 (w3c-sgml-wg@w3.org from September 1996)

From: David G. Durand <dgd@cs.bu.edu>
Date: Sat, 14 Sep 1996 16:05:16 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <v02130500ae60b980c602@[128.148.157.46]>
At 12:33 AM 9/14/96, Arjun Ray wrote:
>At 12:24 PM 9/13/96 -0400, David Durand wrote:
>>I'd like to make sure that they are _never_ invoked.
>
>Aside the (religious) issue of 8879-compliance, can anybody explain why
>they are necessary? For instance, exactly what are RS and RE in a byte
>stream over TCP? Sorry, RS/RE has "fuddy-duddy" written all over it (can
>you say "punch card is all I grok"?)

Well, if you want to be SGML-parsable, you want the content of your
elements to be the same whether the document is interpreted as SGML or XML.
You therefore need either to preserve the RS/RE rules, or find a way that
they not be invoked. Fuddy-duddy or not, the definition in 8879 is a real
issue, and gratuitous incompatibility, especially in the matter of what is
element content, is a problem we should avoid.

>....
>May I propose a focus on relevant concepts rather than rules. In the
>context of essentially free-form text and markup, exactly what does a
>record-end/end-of-line/whatchamacallit mean?
>
>   1. Is it part of the instance text?
>   2. Is it (processable) markup?
>   3. Is it an artifact of the storage strategy the environment was too
>      brain-dead not to have encapsulated?
>
>In some ways, #3 is a special case of (in the sense of imposition on) #2.
>And #1 is clearly unworkable. So, if we want simple and strong rules (the
>kind that establish a two-way relation between ease of programming and
>ease of understanding) the rational approach IMHO is to treat these animals
>as markup always. When needed as instance text, an inline escape mechanism
>should suffice (how about '\' as MSSCHAR?). The problem is reduced to one
>of lexical tokenization, which is where I believe it always belonged.

Funny, I think #1 is clearly correct. Treat CR, LF, and their kin as SGML
has always treated tab and space (ignored in element content, parsed as
data elsewhere). We may need to amend stylesheets to allow a stylesheet to
recognize CR, LF, or CRLF as special formatting in some contexts. This
would simplify entity management and parsing both. It would mean that
line-ending convention transformations would no longer be no-ops, as they
change the byte stream, but this is not such a bad idea.

The worst problem would be dealing with old IBM iron.
You could do OK by a couple of methods, but none of them would be
completely natural with the standard tools (as far as my old mainframe
memories go, at any rate). And you gain character-stream address stability.

I think that the byte-stream file has won, and we should just treat CR and
LF as more whitespace bytes. People who are creating "example" or
"verbatim" tags, just need to use the stylesheet to declare what's up, or
leave it to the application to decide which whitespace differences are
significant. Note that this is how tabs are already treated, and despite a
little pain, the world manages to muddle on.

  -- David


--------------------------------------------+--------------------------
David Durand                  dgd@cs.bu.edu | david@dynamicDiagrams.com
Boston University Computer Science          | Dynamic Diagrams
http://www.cs.bu.edu/students/grads/dgd/    | http://dynamicDiagrams.com/
Received on Saturday, 14 September 1996 16:02:01 UTC