- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Thu, 12 Jun 1997 16:45:36 +1000
- To: <w3c-sgml-wg@w3.org>, "Dave Peterson" <davep@acm.org>
> From: Dave Peterson <davep@acm.org> > At 8:05 PM 6/11/97, Gavin Nicol wrote: > > >I would favor using ISO 10646 as coded character set to use for the > >SGML declaration for XML, and to specify that the character > >*repertiore* available within XML, is that of ISO 10646. I could be > >convinced to line up with Unicode in this regard. > > A character is represented indirectly via a numeric character reference > using a single numeral per character. It only makes sense to represent > high-order 10646 characters via a single long numeral, such as up to > eight digits hex. I agree with Dave 100%, and have suggested similar before. Lets not sell our birthright for a mess of pottage. We can thumb our noses at all the character set people! They spend all their lives on a futile an pathetic exercise of trying to stuff more and more characters into characters sets: they make everything break whenever they discover they have run out of room and need new encodings. But rather than saying "oh, perhaps we are complete idiots, and there is a better way" they keep on going...more encodings, always ending up turning a character set into another encoding, rather than allowing unbounded expansion using markup. This question already arose when Mr Makoto alerted us to the JIS level three and four character set proposals. At that time I proposed: * XML 1.0 should only use 16-bit unicode characters as the document character set; * all other characters, including the surrogate characters, should be references. Lets *target* XML 1.0 at Java 16-bit character systems, as far as parsing goes. And lets try to have the parser deal in full characters, not partial characters, by not allowing surrogate-range characters. (So no surrogate characters in markup.) I am not saying mandate or specify 16-bit for processors, just for documents. > >However, I most certainly do *NOT* think that we have any business > >defining what the processor hands back. Yes. This is another issue: as long as the input uses references to all non-Unicode 2.0 characters, what the parser passes to the application is a matter for implementors. Rick Jelliffe
Received on Thursday, 12 June 1997 02:48:33 UTC