Re: questions on XML sgml decl's charsets from David G. Durand on 1997-01-22 (w3c-sgml-wg@w3.org from January 1997)

From: David G. Durand <dgd@cs.bu.edu>
Date: Wed, 22 Jan 1997 00:04:20 -0500
To: w3c-sgml-wg@www10.w3.org
Message-Id: <v02130502af0b4ab55f64@[205.181.197.81]>
At 1:36 PM 1/21/97, Christopher R. Maden wrote:
>[Michael Sperberg-McQueen]
>
>> The spec says "The specific SGML declaration needed to enable SGML
>> systems to process XML documents will vary from document to
>> document" depending on the character encoding of the document's
>> entities.  It may also vary with the SGML system's understanding and
>> implementation of 8879's rules for character handling, which seem to
>> give rise to wildly varying interpretations.
>
>I consider this a serious flaw in the spec.

I think it's kind of unavoidable, since SGML character handling is such a
mess (in practice) that I doubt we could find any 16 bit solution that
would work on all systems -- possibly even any single declaration that
would work on all systems. The elegant idea of a declaration as a document
specific data specification has long been replaced by the ugly practice of
the declaration as dependent on both processor and input, in my experience.

  I may be clueless here, but I thought the idea was that it should be
possible to process XML with an SGML processor -- I don't remember any
_requirement_ that you do it the same way with different SGML processors.
Obviously it's not a good idea to have too many ways, but two declarations
seems acceptable.

<soapbox>Of course, this is a new way to see the basic point that dealing
with character numbers at all is inherently fragile. We could solve this
problem as well by structured use of SDATA (as we have made structured use
of PI).</soapbox>
>> Systems with no internal representation for strings of 16 and 32
>> bits would appear not to be capable of handling XML.
>
>I don't believe so.  All the markup characters can be recommended in 8
>bits - unsupportable data characters would have some fallback behavior
>when the document is translated from its native encoding to an 8-bit
>encoding.  There would be data loss, but it should not be silent.

I'm being clueless again, but isn't there a way out for UTF-8 encoded
files, if we stay away from numeric character references? As long as every
character that can occur in UTF-8 is allowed in NAMECHAR we would be home
free, wouldn't we? Is that possible, or do UTF-8 strings include all
possible 8-bit character codes? It's clear that such a processor would not
be able to detect references to Unicode characters not allopwed in the 16
bit declaration, but if we have only to worry about adding 8-bit high chars
to the declaration, we might be OK.


  -- David

I am not a number. I am an undefined character.
_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
Received on Tuesday, 21 January 1997 23:57:07 UTC