- From: David G. Durand <dgd@cs.bu.edu>
- Date: Wed, 22 Jan 1997 00:04:20 -0500
- To: w3c-sgml-wg@www10.w3.org
At 1:36 PM 1/21/97, Christopher R. Maden wrote: >[Michael Sperberg-McQueen] > >> The spec says "The specific SGML declaration needed to enable SGML >> systems to process XML documents will vary from document to >> document" depending on the character encoding of the document's >> entities. It may also vary with the SGML system's understanding and >> implementation of 8879's rules for character handling, which seem to >> give rise to wildly varying interpretations. > >I consider this a serious flaw in the spec. I think it's kind of unavoidable, since SGML character handling is such a mess (in practice) that I doubt we could find any 16 bit solution that would work on all systems -- possibly even any single declaration that would work on all systems. The elegant idea of a declaration as a document specific data specification has long been replaced by the ugly practice of the declaration as dependent on both processor and input, in my experience. I may be clueless here, but I thought the idea was that it should be possible to process XML with an SGML processor -- I don't remember any _requirement_ that you do it the same way with different SGML processors. Obviously it's not a good idea to have too many ways, but two declarations seems acceptable. <soapbox>Of course, this is a new way to see the basic point that dealing with character numbers at all is inherently fragile. We could solve this problem as well by structured use of SDATA (as we have made structured use of PI).</soapbox> >> Systems with no internal representation for strings of 16 and 32 >> bits would appear not to be capable of handling XML. > >I don't believe so. All the markup characters can be recommended in 8 >bits - unsupportable data characters would have some fallback behavior >when the document is translated from its native encoding to an 8-bit >encoding. There would be data loss, but it should not be silent. I'm being clueless again, but isn't there a way out for UTF-8 encoded files, if we stay away from numeric character references? As long as every character that can occur in UTF-8 is allowed in NAMECHAR we would be home free, wouldn't we? Is that possible, or do UTF-8 strings include all possible 8-bit character codes? It's clear that such a processor would not be able to detect references to Unicode characters not allopwed in the 16 bit declaration, but if we have only to worry about adding 8-bit high chars to the declaration, we might be OK. -- David I am not a number. I am an undefined character. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________
Received on Tuesday, 21 January 1997 23:57:07 UTC