Re: XML character sets: the hard-minimalist manifesto from Steven J. DeRose on 1996-09-13 (w3c-sgml-wg@w3.org from September 1996)

From: Steven J. DeRose <sjd@ebt.com>
Date: Fri, 13 Sep 1996 17:45:43 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <2.2.32.19960913214543.00974fe4@kirk.ebt.com>

At 08:24 PM 09/13/96 GMT, Gavin Nicol wrote:
>I have never argued that all XML parsers needs to support all
>encodings. I have only said that we should give XML parsers freedom to
>support the encodings they choose to (I could even go as far as to say
>that we might perhaps require that all XML parser be able to at least
>*accept* data in UTF-8 and UCS-2, though that would also prescribe
>certain implementation details, which I feel to be undesirable).

What if we were to:

   a) require parsers to accept UTF-8 input
   b) recommend supporting UCS-2 input
   c) permit supporting any other encoding as input
   d) require parsers to be able to write out a UTF-8, structurally equivalent,
      canonical XML for their input (they may provide any other output/API
forms 
      they want, but this one would be required)

Some big wins from a canonical form, especially one that is in a specific
character set and encoding:

a) Once you've parsed an XML document once, all further parses must produce
an absolutely byte-identical copy of the XML document that came back from
that first parse. I believe the right term for this is "idempotence". This
makes parser conformance testing trivial.

b) Authenticating parser output becomes a reliable way to authenticate the
data. A recipient can tell if it got the right information, even if the
sender expressed it somehow differently but it reduces to the same thing
(even if we eschew all minimization, we'll probably allow multiple spaces
between attribute values, and other non-information-bearing alternatives).

c) It forces us to enumerate exactly what aspects of the *representation*
consitute information, and what aspects do not. For example, if SGML had a
canonical form in which spaces around the "=" in attribute specification
lists were gone, it could easily state that no process may depend upon such
spaces to make some processing distinction: all it need state is that
"processig semantics must be driven only by information in the canonical form". 

<digression type="nerd.linguist">(c) is almost exactly like Chomsky's early
insistence that semantics be determined at the level of deep syntactic
structure, and that syntactic transformations do not change the formal
meaning of sentences. This proved untenable in natural language for
discourse and other reasons, but it can easily be maintained in formal
languages (like all computer languages).</digression>

Steve

Received on Friday, 13 September 1996 17:47:42 UTC