W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > June 1997

Re: I18N issue needs consideration

From: James Clark <jjc@jclark.com>
Date: Sat, 14 Jun 1997 12:32:25 +0700
Message-Id: <199706140539.XAA07810@jclark.com>
To: <w3c-sgml-wg@w3.org>

> For example, if a script is iterating or counting the characters in a
text
> object that was retrieved from the DOM, doesn't the result depend on the
> encoding of the characters in the text object as presented by the DOM
(which
> may be different from their representation internally)?  If the DOM
doesn't
> specify a more specific encoding, doesn't it open the way for one
> implementation to say that it uses UTF-8 encoding for text content
returned
> from the DOM, and another say that it uses Unicode code points, and a
third
> DOM implementation to have its strings composed of 31 bit characters? 
Won't
> the scripts executing on the different implementations have radically
different
> behavior?

If a script is supposed to be iterating over *characters* then the encoding
of the characters is completely irrelevant.  Whether a character is encoded
as UTF-8 or UTF-16 or UCS-4, it's still a single character.  Iterating over
a sequence of characters is not the same as iterating over the objects that
encode the characters (bytes or 16-bit words or whatever).

If a character is outside the BMP and so requires 2 16-bit objects to
encode it in UTF-16, it's still one character not two.  It should be
completely invisible to a DOM user whether a character is inside or outside
the BMP.  An object model that pretended that a character outside the BMP
was two "characters" would, in my view, be totally broken.  The result of
such an object model would be that many applications would fail to work
properly on characters outside the BMP.

I'm not sure I agree with Gavin when he says that all that is needed is a
String type.  I think you need a Character type as well.  I suppose you
could say that a Character will be represented by a String containing a
single character, but I think it would be better to allow an individual DOM
language binding to choose whether to say, for that language, a Character
will be represented by a one-character string or by a separate data type. 
For example, if I wanted to use the DOM for DSSSL, I would want DOM
characters to be represented as DSSSL characters not strings.

James
 
Received on Saturday, 14 June 1997 01:39:15 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:04:41 EDT