- From: Steven J. DeRose <sjd@ebt.com>
- Date: Fri, 13 Sep 1996 17:45:43 -0400
- To: w3c-sgml-wg@w3.org
At 08:24 PM 09/13/96 GMT, Gavin Nicol wrote: >I have never argued that all XML parsers needs to support all >encodings. I have only said that we should give XML parsers freedom to >support the encodings they choose to (I could even go as far as to say >that we might perhaps require that all XML parser be able to at least >*accept* data in UTF-8 and UCS-2, though that would also prescribe >certain implementation details, which I feel to be undesirable). What if we were to: a) require parsers to accept UTF-8 input b) recommend supporting UCS-2 input c) permit supporting any other encoding as input d) require parsers to be able to write out a UTF-8, structurally equivalent, canonical XML for their input (they may provide any other output/API forms they want, but this one would be required) Some big wins from a canonical form, especially one that is in a specific character set and encoding: a) Once you've parsed an XML document once, all further parses must produce an absolutely byte-identical copy of the XML document that came back from that first parse. I believe the right term for this is "idempotence". This makes parser conformance testing trivial. b) Authenticating parser output becomes a reliable way to authenticate the data. A recipient can tell if it got the right information, even if the sender expressed it somehow differently but it reduces to the same thing (even if we eschew all minimization, we'll probably allow multiple spaces between attribute values, and other non-information-bearing alternatives). c) It forces us to enumerate exactly what aspects of the *representation* consitute information, and what aspects do not. For example, if SGML had a canonical form in which spaces around the "=" in attribute specification lists were gone, it could easily state that no process may depend upon such spaces to make some processing distinction: all it need state is that "processig semantics must be driven only by information in the canonical form". <digression type="nerd.linguist">(c) is almost exactly like Chomsky's early insistence that semantics be determined at the level of deep syntactic structure, and that syntactic transformations do not change the formal meaning of sentences. This proved untenable in natural language for discourse and other reasons, but it can easily be maintained in formal languages (like all computer languages).</digression> Steve
Received on Friday, 13 September 1996 17:47:42 UTC