W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > June 1997

Re: I18N issue needs consideration

From: Dave Peterson <davep@acm.org>
Date: Sat, 14 Jun 1997 20:38:25 -0400
Message-Id: <v01540b01afc8acf46104@[207.60.235.15]>
To: <w3c-sgml-wg@w3.org>
At 12:32 PM 6/14/97, James Clark wrote:
>> For example, if a script is iterating or counting the characters in a
>text
>> object that was retrieved from the DOM,

>If a script is supposed to be iterating over *characters* then the encoding
>of the characters is completely irrelevant.  Whether a character is encoded
>as UTF-8 or UTF-16 or UCS-4, it's still a single character.  Iterating over
>a sequence of characters is not the same as iterating over the objects that
>encode the characters (bytes or 16-bit words or whatever).
>
>If a character is outside the BMP and so requires 2 16-bit objects to
>encode it in UTF-16, it's still one character not two.  It should be
>completely invisible to a DOM user whether a character is inside or outside
>the BMP.  An object model that pretended that a character outside the BMP
>was two "characters" would, in my view, be totally broken.  The result of
>such an object model would be that many applications would fail to work
>properly on characters outside the BMP.

Precisely.  If you're counting the houses on your street, you shouldn't
be counting the bricks in each house.  You shouldn't care whether the
houses all are made up of the same number of bricks, and shouldn't even
care whether the houses are made of brick or some other construction
material.

Dave Peterson
SGMLWorks!

davep@acm.org
Received on Saturday, 14 June 1997 20:38:39 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:04:41 EDT