[Prev][Next][Index][Thread]

Re: Concrete syntax, character sets



>Having just gone through a big struggle in WG8 and X3V1 over the ERCS
>proposal, I would feel pretty strange about limiting markup to
>something that not even Western Europeans could use the way they want
>to.  I would like to see some serious discussion of this point.

I would feel somewhat strange about not supporting native language
markup, particularly as we're going to have to use a variant concrete
syntax to support native language content. It seems to me that the
most reasonable thing to do would be to decide upon a syntax that
used ISO 10646 for both data and markup... 

We had this same discussion in HTML-WG, and I pushed for a syntax that
used ISO 10646 as the document character set. This, and other
discussion led to the HTML I18N draft, which is moving towards
proposed standard (and it'll probably be adopted by W3C in some HTML
revision). It seems that in the interest of compatibility, we should
have a similar concrete syntax, though with an extended markup
character repertoire.

The ERCS work that Rick did is very important, and I do not think it
is a great burden for XML browsers to support it, at least to a
minimal degree. In fact, given that we also have content negotiation
in the WWW, and that HTTP 1.1 is becoming somewhat stricter on content
labelling requirements, XML browsers would not need to support any
encodings other than those deemed important by the companies producing
them. 

>It's certainly thinkable to me.  Is it thinkable to say that "all
>markup is in UTF8" as well?

No, it's not, because then you'd also require all content to be in
UTF8, and many users have no way of creating such data. In the best
cases, producing such data usually involves a conversion somwhere.

Again, we had the same discussion in HTML-WG. There are many good
reasons for selecting a single document character set, and then just
looking upon SJIS and whatnot as encodings.


References: