Re: I18N issue needs consideration

> From: Dave Peterson <davep@acm.org>
 
> At 8:05 PM 6/11/97, Gavin Nicol wrote:
> 
> >I would favor using ISO 10646 as coded character set to use for the
> >SGML declaration for XML, and to specify that the character
> >*repertiore* available within XML, is that of ISO 10646. I could be
> >convinced to line up with Unicode in this regard.
> 
> A character is represented indirectly via a numeric character reference
> using a single numeral per character.  It only makes sense to represent
> high-order 10646 characters via a single long numeral, such as up to
> eight digits hex.

I agree with Dave 100%, and have suggested similar before.

Lets not sell our birthright for a mess of pottage.  We can thumb our noses
at all the character set people!  They spend all their lives on a futile an
pathetic exercise of trying to stuff more and more characters into characters
sets: they make everything break whenever they discover they have run out
of room and need new encodings. 

But rather than saying "oh, perhaps we are complete idiots, and there
is a better way" they keep on going...more encodings, always ending up 
turning a character set into another encoding, rather than allowing
unbounded expansion using markup.

This question already arose when Mr Makoto alerted us to the JIS level three 
and four character set proposals. At that time I proposed:

* XML 1.0 should only use 16-bit unicode characters as the document character set;
* all other characters, including the surrogate characters, should be references. 

Lets *target* XML 1.0 at Java 16-bit character systems, as far as parsing goes. And
lets try to have the parser deal in full characters, not partial characters, by not
allowing surrogate-range characters.  (So no surrogate characters in markup.)
I am not saying mandate or specify 16-bit for processors, just for documents.

> >However, I most certainly do *NOT* think that we have any business
> >defining what the processor hands back. 

Yes. This is another issue: as long as the input uses references to all non-Unicode 2.0
characters,  what the parser passes to the application is a matter for implementors.

Rick Jelliffe

Received on Thursday, 12 June 1997 02:48:33 UTC