Re: I18N issue needs consideration

It would appear that I'm the only one in the world who thinks it
would be desirable to specify 16-bit quanta for character passing,
and the use of the Unicode past-BMP scheme.  Oh, well, not quite, the 
people who build Java, Netscape, and Windows/NT also take that view.  
Now perhaps we work to a higher standard of purity here, but it seems 
highly questionable to take XML charging off in a direction 
that's incompatible with actual industry practice.

Several have asserted that we should just say nothing.  While we
have not undertaken the task of an XML API, specifying character 
quanta is a very small API chunk with a huge reward in 
interoperability.  The supposed benefit is increased abstraction.
We don't want abstraction, we want lightweight, working, interoperable
applications.  The #1 difference between SGML and XML is that
we abandoned abstract syntax.  There is no stronger case for abstract
syntax in characters than in markup delimiters (for XML - for 
SGML, abstraction is obviously the way to go).

Some have asserted that for past-BMP chars, the char references
should be in one chunk (e.g. �, which is from the Unicode
surrogate area, but is not a real example because  there are no such
characters yet). This seems indubitably correct, but for interoperability 
the processor should still pass two 16-bit surrogates to the app.

[Note for fans of *really* exotic characters - the Unicode surrogate
mechanism sets aside 131K character positions for 31-bit
private-use characters.]

There have been several assertions, without supporting arguments, that
we should take the ISO flat-31-bit-space mode.  Unless I hear some
good reasons, wasting 50% of the character-passing bandwidth in
order to support 0.00005% of characters, which characters have never 
heretofore been available to computers, just seems like rank stupidity.

Having said all this, let's take some points in context.

Dave Peterson:
>It only makes sense to represent
>high-order 10646 characters via a single long numeral, such as up to
>eight digits hex.

I agree.

Dave again:
>I heartily agree that we should not be prescribing the representation
>of characters used internally within a software system, including
>between its components (like between the XML-processor and an application
>coupled thereto).

James Clark
>It should be able to pass any representation of the character it finds
>convenient.

As assertions, these aren't good enough.  The benefit of specifying
the character representations is an immense increase in international 
interoperability.  Is there a remotely comparable cost?

James again
>Not all scripts have combining characters.   If I am working with a script
>that doesn't have combining characters and does use a lot of characters
>outside the BMP (Chinese for example), then it would make sense to use
>internally a 32-bit fixed width encoding.

No, because you will be wasting 99.999999% of all your internal
character buffers.  Maybe memory is just not an issue in these apps?
It is not in fact the case that Chinese texts "use lots of characters
outside the BMP" - in fact, all the Chinese apps of today use none, and
they seem to get by.  

Anyhow, a similar argument could say "I know I'm processing English,
thus I'll just use 7-bit ASCII for everything".  I think that neither
this nor the Chinese-only attitude described above is in the spirit of
what we've done so far, and we want to build a powerful disincentive
to this kind of sloppiness into the spec.

James again
>The place where this needs to be addressed is when you do a binding of the
>DOM to a particular programming language.  

Right... but I had hoped for XML apps to be interoperable for other
APIs than the DOM.

Dan Connolly:
>A character encoding scheme over some repertoire is an algorithm or
>function that maps a sequence of octets into a sequence of characters 
>in the repertoire.  On the other hand, a coded character set C over some 
>repertoire maps each character H in the repertoire to a non-negative 
>integer called a code...

Dan is catching me in a gross error - in my original post on this
I discussed "encodings", which is bogus... I wasn't talking about
the actual encodings in the entity, I was talking about what the
processor, having read the entity, passes to the app.  Dan's 
discussion of encoding is way more precise than what's in the spec;
it is a useful argument (but not the one we're having here) whether
the spec should be recast this way, or left in the way it is, where
it basically punts encodings and says the processor should do the
best it can, but pass Unicode/10646 chars to the application.

Dan again:
>A character is an atomic unit of communication;
>it is not composed of bits. A character
>can be encoded by a sequence of octets, or represented by
>a code position (an integer) in a coded character set.

That's the key disagreement.  The analogy to SGML is clear; SGML
says an element is an abstract thingie in a document that can be
delimited by any of an infinite number of different syntaxes, or
not delimited at all in the case of minimization. XML says an
element is something that is delimited by tags with a fixed
syntax.  The position I'm advancing is that XML do the same
deliberate abandonment of abstraction at the character level,
saying characters are indeed the bit patterns described in
Unicode, with the semantics and processing characteristics
described in Unicode, and that's all there is to it.  I would
*not* support this position for SGML.

Gavin Nicol:
>Also, intuitively, this makes
>sense, because a character *is* an abstract object.

In XML it doesn't have to be.

Postscript: It would be kind of nice if the representatives of companies 
on this list, who have collectively invested billions of dollars in 
Unicode-compliant APIs, would step forward to explain why they think this 
is a good idea.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-708-9592

Received on Friday, 13 June 1997 12:34:55 UTC