- From: David G. Durand <dgd@cs.bu.edu>
- Date: Sat, 14 Sep 1996 16:05:16 -0400
- To: w3c-sgml-wg@w3.org
At 12:33 AM 9/14/96, Arjun Ray wrote: >At 12:24 PM 9/13/96 -0400, David Durand wrote: >>I'd like to make sure that they are _never_ invoked. > >Aside the (religious) issue of 8879-compliance, can anybody explain why >they are necessary? For instance, exactly what are RS and RE in a byte >stream over TCP? Sorry, RS/RE has "fuddy-duddy" written all over it (can >you say "punch card is all I grok"?) Well, if you want to be SGML-parsable, you want the content of your elements to be the same whether the document is interpreted as SGML or XML. You therefore need either to preserve the RS/RE rules, or find a way that they not be invoked. Fuddy-duddy or not, the definition in 8879 is a real issue, and gratuitous incompatibility, especially in the matter of what is element content, is a problem we should avoid. >.... >May I propose a focus on relevant concepts rather than rules. In the >context of essentially free-form text and markup, exactly what does a >record-end/end-of-line/whatchamacallit mean? > > 1. Is it part of the instance text? > 2. Is it (processable) markup? > 3. Is it an artifact of the storage strategy the environment was too > brain-dead not to have encapsulated? > >In some ways, #3 is a special case of (in the sense of imposition on) #2. >And #1 is clearly unworkable. So, if we want simple and strong rules (the >kind that establish a two-way relation between ease of programming and >ease of understanding) the rational approach IMHO is to treat these animals >as markup always. When needed as instance text, an inline escape mechanism >should suffice (how about '\' as MSSCHAR?). The problem is reduced to one >of lexical tokenization, which is where I believe it always belonged. Funny, I think #1 is clearly correct. Treat CR, LF, and their kin as SGML has always treated tab and space (ignored in element content, parsed as data elsewhere). We may need to amend stylesheets to allow a stylesheet to recognize CR, LF, or CRLF as special formatting in some contexts. This would simplify entity management and parsing both. It would mean that line-ending convention transformations would no longer be no-ops, as they change the byte stream, but this is not such a bad idea. The worst problem would be dealing with old IBM iron. You could do OK by a couple of methods, but none of them would be completely natural with the standard tools (as far as my old mainframe memories go, at any rate). And you gain character-stream address stability. I think that the byte-stream file has won, and we should just treat CR and LF as more whitespace bytes. People who are creating "example" or "verbatim" tags, just need to use the stylesheet to declare what's up, or leave it to the application to decide which whitespace differences are significant. Note that this is how tabs are already treated, and despite a little pain, the world manages to muddle on. -- David --------------------------------------------+-------------------------- David Durand dgd@cs.bu.edu | david@dynamicDiagrams.com Boston University Computer Science | Dynamic Diagrams http://www.cs.bu.edu/students/grads/dgd/ | http://dynamicDiagrams.com/
Received on Saturday, 14 September 1996 16:02:01 UTC