Are Heresies Allowed? (Was: RS/RE)
At 12:24 PM 9/13/96 -0400, David Durand wrote:
>I think that the rules for RS/RE processing in SGML are confusing.
I prefer Dan Connolly's "phase-of-the-moon" characterisation, myself:-)
>I'd like to make sure that they are _never_ invoked.
Aside the (religious) issue of 8879-compliance, can anybody explain why
they are necessary? For instance, exactly what are RS and RE in a byte
stream over TCP? Sorry, RS/RE has "fuddy-duddy" written all over it (can
you say "punch card is all I grok"?)
The basic problem has always been to allow relatively "free-form" markup,
mainly for neatness and readability. If RS/RE had really been an issue
(as opposed to grandfathering ante-diluvian kludgery) the solution was to
allow only *one tag per record*, at the beginning (modulo leading WS?):
this would have given relevant if not functional meaning to the term
"record". But all we have now is a bunch of rules just getting in the way.
May I propose a focus on relevant concepts rather than rules. In the
context of essentially free-form text and markup, exactly what does a
1. Is it part of the instance text?
2. Is it (processable) markup?
3. Is it an artifact of the storage strategy the environment was too
brain-dead not to have encapsulated?
In some ways, #3 is a special case of (in the sense of imposition on) #2.
And #1 is clearly unworkable. So, if we want simple and strong rules (the
kind that establish a two-way relation between ease of programming and
ease of understanding) the rational approach IMHO is to treat these animals
as markup always. When needed as instance text, an inline escape mechanism
should suffice (how about '\' as MSSCHAR?). The problem is reduced to one
of lexical tokenization, which is where I believe it always belonged.
>for SGML is that we could assign them codes (in the delcaration) that the
>entity manager would _never_ see or produce. For instance an 8-bit
>entity manager would assing RS to code 256 and RE to code 257. The
>codes for CR and LF would be declared to be the same class as TAB and
>SPACE. For SGML this fits the letter of the standard, but parsers will
>currently choke, or assume that a wider character set is in use.
> For XML we can simplify things by just treating LF and CR (well,
>actually their 10646 equivalents) as whitespace. We can simply skip
>the entire notion of non-significant and significant RS/RE entirely.
> -- David