Re: character sets - a summary and a proposal

>performed, but everyone in the discussion seems willing to make a leap
>of faith here and believe Gavin Nicol when he says it's not very hard.)

At some point, I will try to dig out the wchar_t TEI parser I produced.

>3.  Let 100 Flowers Bloom:  Gavin Nicol and Todd Bauman have argued for a
>third position which I understand to have the following salient points:
>  - XML data streams can be in any known or documentable encoding
>  - XML systems may accept data streams in any format(s) they choose
>    to support; they are encouraged but not required to accept UTF-8

I have no objection to UTF-8 being required as a compromise position.

>  - all XML systems must implement and rely on external specification of
>    the coded character set / encoding, such as MIME or attributes on
>    an FSI
>  - each XML system must support content negotiation so clients and
>    servers can avoid sending or receiving XML data in unsupported
>    encodings

Only for systems that are client/server.

>This position seems, in some ways, to be even more minimalist than Tim
>Bray's, since there is *no* coded character set or encoding which *all*
>XML systems are required to support. 

ISO 10646 would need to be "supported", though there are many levels of

>4.  The Hard Maximalist Position:  this is what I originally understood
>Nicol and Bauman to be arguing for; it's not wholly unlike the apparent
>intent of ISO 8879, as I understand it, though there are some obvious
>differences of detail.

This is actually my preferred path, but it is probably too hard for
most people to implement (I don't think it overly difficult though...) 

>5.  The Eclectic Compromise (DeRose):  a slight extension of the
>Dual-Track approach:

This is a reasonable approach.

>Q1 should there be any minimal function required of all conforming XML
>systems, any coded character set or character encoding they are all
>required to accept as input, whether across the net or from disk?

The coded character set should be ISO 10646. I am willing to accept
UTF-8 as required (I argued exactly that position on HTML-WG a long
time ago).

>Q2 should conforming XML systems be prohibited from accepting any
>input format they are not required to accept?

Absolutely NOT!

>Q3 if XML systems may accept different sets of input formats (whether
>or not these sets overlap), can we ensure interoperability
>in some way, or is that a lost cause?

Interoperability is something to be greatly desired, and in fact, the
primary reason I got involved in HTML I18N was precisely
that. However, I do not believe that at this time, we can get to a
point where all XML systems will be able to process all XML
documents. At some point in the future (3-5 years), perhaps. Now, no. 

>Q4 if XML systems may *only* accept Unicode (whether just UTF-8 or
>also UTF-16), is there anything that can be done to make life
>easier for users of current systems which rely on Ascii, ISO 8859-1
>or 8859-*, JIS, Shift-JIS, EUC, etc.?

No. Converters can be (perhaps even easily) written. However, as the
old saying goes "You can lead a horse to water, but you can't make him

>Note on autodetection of character sets.

Autodetection fails abysmally as soon as you get more than a few

>Perhaps we should say that network transmissions (or http transmission)
>should always be in UTF-8, and the other supported formats are only
>for local use on disk ...)

With content negotation, your application can enforce this if it

From a practical point of view, no matter what we say, I expect the
following to happen:

  1) People will use ISO 10646. Entity sets for ISO 10646 (like
     the SPREAD set), and numeric character references will use it.
  2) People will tend (at least initially) to author by hand. There
     will be a large number of documents in ASCII, ISO 8859-?, SJIS,
     EUC-JP, ISO-2022-?, EUC-KR, KSC, BIG5.
  3) Initial browsers will be written to support Western languages or
  4) Localised versions of such browsers will become available (ie. pure 
     SJIS, EUC or ISO2022 for example).
  5) Client/server interaction will initially be primarily in the native
  6) Over time, a transition will be made to UTF-8/UTF-16 (ie. as more
     and better tools become available).

We should recognise, and accept this.