Re: I18N issue needs consideration from Gavin Nicol on 1997-06-13 (w3c-sgml-wg@w3.org from June 1997)

From: Gavin Nicol <gtn@eps.inso.com>
Date: Fri, 13 Jun 1997 14:38:31 -0400
To: tbray@textuality.com
CC: w3c-sgml-wg@w3.org
Message-Id: <199706131838.OAA25791@nathaniel.eps.inso.com>
[It is interesting to me to note that you respond to many of the points
I made first, within the content of quotations from others]

>It would appear that I'm the only one in the world who thinks it
>would be desirable to specify 16-bit quanta for character passing,
>and the use of the Unicode past-BMP scheme.  Oh, well, not quite, the 
>people who build Java, Netscape, and Windows/NT also take that view.  
>Now perhaps we work to a higher standard of purity here, but it seems 
>highly questionable to take XML charging off in a direction 
>that's incompatible with actual industry practice.

It is not incompatible at all. The model I outlined (characters as
abstract objects and a flat 31 bit code set) allows Netscape et
al. to do exactly what they are doing today, and to continue doing so
even once some surrogate planes have been defined.

>Several have asserted that we should just say nothing.  While we
>have not undertaken the task of an XML API, specifying character 
>quanta is a very small API chunk with a huge reward in 
>interoperability. 

I do not think so, because is precludes certain forms of
implementation. 

>There have been several assertions, without supporting arguments, that
>we should take the ISO flat-31-bit-space mode.  Unless I hear some
>good reasons, wasting 50% of the character-passing bandwidth in
>order to support 0.00005% of characters, which characters have never 
>heretofore been available to computers, just seems like rank
>stupidity.

Only to people who do not understand that this HAS NO EFFECT
WHATSOEVER ON THE CHARACTER PASSING BANDWIDTH. Wake up, and realise
that evben though my character code points are 31 bit, my application
DOES NOT have to have an internal 31 bit representation. For companies
that expect a large majority of their customer base to be in Latin,
UTF-8 would be the encoding of choice to use across all parts of the
system. For Japanese compaies, a 16 or 31 bit interface might be
better. I would argue that forcing people to return 16 bits across the
interfaces, would in some cases, lead to a larger memory footprint.
There are tradeoffs in implementations, and people should be allowed
to weigh them, and decide for themselves what is best.

>Dave again:
>>I heartily agree that we should not be prescribing the representation
>>of characters used internally within a software system, including
>>between its components (like between the XML-processor and an application
>>coupled thereto).
>
>James Clark
>>It should be able to pass any representation of the character it finds
>>convenient.
>
>As assertions, these aren't good enough.  The benefit of specifying
>the character representations is an immense increase in international 
>interoperability.  Is there a remotely comparable cost?

It is interesting to note again that you didn't add me to the list of
people making assertions, as I was the one Dave was agreeing with. To
answer your assertion: the cost is in implementation. Some people may
wish to implement a system in a particular way (say for example, they
want to add XML support to their email program). The idea that we
*require* them to deal with 16 bits, will probably be quite abhorrent
to them, and could force a far great cost upon them than they would
otherwise have (for example, they could limit the number of possible
encodings they support to only those found within the lower 8 bits of
ISO 10646, and thereby, stick with a char* interface). Similar
cost/benefit scenarios come up for almost any peice of software I can
imagine. At some point, most systems will probably *use* 16 bits or 31
bits, but we should not force them to do so. This is a decision that
should be left in the hands of the developers, who are far more
capable of analysing the cost/benefit ratio.

>James again
>>Not all scripts have combining characters.   If I am working with a script
>>that doesn't have combining characters and does use a lot of characters
>>outside the BMP (Chinese for example), then it would make sense to use
>>internally a 32-bit fixed width encoding.
>
>No, because you will be wasting 99.999999% of all your internal
>character buffers.  Maybe memory is just not an issue in these apps?
>It is not in fact the case that Chinese texts "use lots of characters
>outside the BMP" - in fact, all the Chinese apps of today use none, and
>they seem to get by.  

Well, actually, sometimes 31 bits would be perfectly valid (for
example, when processing ancient chinese texts), and you would not be
"wasting" any memory at all. The application designers can choose the
cost/benefit ratio. Also, your last statement is quite incorrect. 
Current Chinese usage includes thousands of characters outside the
BMP, and mathematical works include many more. To quote one fellow
from the last Unicode conference "there will never be enough
ideographic characters".

>Anyhow, a similar argument could say "I know I'm processing English,
>thus I'll just use 7-bit ASCII for everything".  I think that neither
>this nor the Chinese-only attitude described above is in the spirit of
>what we've done so far, and we want to build a powerful disincentive
>to this kind of sloppiness into the spec.

I do not consider this sloppiness. A few years ago, I was like you,
and really wanted to start moving the WWW over to Unicode as soon as
possible. I still see that as a worthy goal, but I have come to
appreciate that in some cases, Unicode is too much, and in others,
nowhere near enough. Do you wish to require that *all* processors
return 16 bit? How about a parser in TCL, Python, etc?

>James again
>>The place where this needs to be addressed is when you do a binding of the
>>DOM to a particular programming language.  
>
>Right... but I had hoped for XML apps to be interoperable for other
>APIs than the DOM.

And as you know, I suggested we just have the DOM to say "string",
and nothing more. This is perfectly reasonable, I beleive.

>it is a useful argument (but not the one we're having here) whether
>the spec should be recast this way, or left in the way it is, where
>it basically punts encodings and says the processor should do the
>best it can, but pass Unicode/10646 chars to the application.

I hope you have not recast the spec to conform to your viewpoint
without taking note of the people who dissent.

>>A character is an atomic unit of communication;
>>it is not composed of bits. A character
>>can be encoded by a sequence of octets, or represented by
>>a code position (an integer) in a coded character set.
>
>That's the key disagreement.  The analogy to SGML is clear; SGML
>says an element is an abstract thingie in a document that can be
>delimited by any of an infinite number of different syntaxes, or
>not delimited at all in the case of minimization. XML says an
>element is something that is delimited by tags with a fixed
>syntax.  The position I'm advancing is that XML do the same
>deliberate abandonment of abstraction at the character level,
>saying characters are indeed the bit patterns described in
>Unicode, with the semantics and processing characteristics
>described in Unicode, and that's all there is to it.  I would
>*not* support this position for SGML.

But then you have moved away from "character". This is then called a
code point, from which it will soon degrade into a discussion of the
encoding of those code points.

>Gavin Nicol:
>>Also, intuitively, this makes
>>sense, because a character *is* an abstract object.
>
>In XML it doesn't have to be.

Well then you are redefining what the Unicode and ISO folks have all
agreed upon as a character.

>Postscript: It would be kind of nice if the representatives of companies 
>on this list, who have collectively invested billions of dollars in 
>Unicode-compliant APIs, would step forward to explain why they think this 
>is a good idea.

I will tell you that Inso has invested heavily in Unicode, at a
considerable cost. This support is still far from perfect, partly
because underlying system support is very spotty. We made the
transition because we were interested in moving to a more robust
architecture for International content delivery. I guess this is the
prime motivation for everyone. I should note that we are already
finding some cases where Unicode is not enough. The costs, and
difficulties of the transition should certainly not be underestimated
either. 

I am not saying that I do not think Unicode is a good idea. I am a
strong Unicode supporter. I do not think a *language* specification
has any business placing requirements on *implementations*. Even JAVA
doesn;t *require* implementations to pass Unicode through the FFI.
In many cases, Unicode may be the best thing to have, in others, it is
not. Let the people who can make the decisions, make them in freedom.

Finally, I should note that this discussion mirrors a discussion we
had a *long* time ago on HTML WG. I'm sure that many people here
remember that episode (Dan's analysis was very useful then too, and I
had many interesting discussion with Rick), and do not wish to repeat
it.
Received on Friday, 13 June 1997 14:39:15 UTC