Re: I18N issue needs consideration

At 8:05 PM 6/11/97, Gavin Nicol wrote:

>I would favor using ISO 10646 as coded character set to use for the
>SGML declaration for XML, and to specify that the character
>*repertiore* available within XML, is that of ISO 10646. I could be
>convinced to line up with Unicode in this regard.

A character is represented indirectly via a numeric character reference
using a single numeral per character.  It only makes sense to represent
high-order 10646 characters via a single long numeral, such as up to
eight digits hex.

>However, I most certainly do *NOT* think that we have any business
>defining what the processor hands back. This is purely an
>implementation issue, and not one that belongs in XML-lang. I can
>return a stream of 31 bit character coded in any number of different
>encodings. I might return then as UTF in my application, or as UCS, or
>as a string encoded using hex digits.
>
>There is one more issue, and that is the question of how the
>application represents/interprets characters. I personally like to
>view characters as a purely abstract object, thereby leaving the
>widest possible choice of implementation strategies, though this does
>not seem to be the model favoured by SGML (this *is* the model for
>HTML).

In fact, this *is* the "new" SGML model.  Personally, I'd like to see
it made official with the TC, not even waiting for the revision.  As
you say, it's the model for HTML--which is one reason that the "new"
SGML model came up for discussion in the first place.  It's highly
appropriate.

(For the record, the "new" SGML model *permits* you to use the document
character set to describe storage and processing representations, but
does not require it.)

I heartily agree that we should not be prescribing the representation
of characters used internally within a software system, including
between its components (like between the XML-processor and an application
coupled thereto).

Dave Peterson
SGMLWorks!

davep@acm.org

Received on Wednesday, 11 June 1997 21:19:21 UTC