[Prev][Next][Index][Thread]

Re: XML character sets: a proposal



>I don't see the problem.  A 7-bit ASCII file is also a UTF-8 file.
>Isn't that the whole point of UTF-8?

What I meant was that people will deal only in 7 bits, and
applications will (generally) deal only in 7 bit character codes. 

>>    2) People who do not use ASCII (SJIS, EUC, JOHAB etc.) will
>>       *ignore* this requirement and implement systems that handle the
>>       encodings they use every day.
>
>XML isn't intended to be convenient to create by hand with a text
>editor.  People are mostly expected to be using SGML/XML editors to
>create.  Maybe for the next years, until support for Unicode comes
>widespread, people who want to create documents containing Asian
>characters with plain text editors will need to run a filter on their
>files after editing them.  That doesn't seem like a big deal to me.
>The important issue seems to me to be that they can represent the
>characters that they need to.

Well, that is quite the opposite to what others have been saying, and
also seems a great way of slowing down XML adoption. I think you'll
find that it will be quite some time before we see a large number of
good, free XML editors (or more importantly, word processors). I
wouild argue that for the immediate future (or perhaps, for *the*
future of XML), we need to make it as easy as possible for people to
create content *today*. That means making it easy for them to just use
notepad, wordpad, write, emacs, vi, or cat > myfile.xml.

>I don't see how allowing multiple encodings is going to help with
>things like classical Buddhist texts.  Since you've agreed that the
>character repertoire should be ISO 10646, you're not going to be able
>to represent any characters in your encoding that you can't represent
>in UTF-8 or UTF-16.  Support for multiple encodings can't be buying
>you anything more than convenience.

Sure, and convenience is everything. People are lazy.

For characters that are not in ISO 10646 (and there is a large, and
increasing number of them) there needs to be a convenient way of
representing, and manipulating them. 

Obviously, if they are not part of ISO 10646, then one way of mapping
them to ISO 10646 is through the private use area, which can be
encoded using UTF-8 and UTF-16, but then some semantics are lost. This
is theoretically what an XML application for such data would do,
though in actual practise, the application might choose a slightly
different processing model. This extra flexibility is the important
point. 




References: