Re: Concrete syntax, character sets

>>syntax to support native language content. It seems to me that the
>>most reasonable thing to do would be to decide upon a syntax that
>>used ISO 10646 for both data and markup... 
>Well, data for sure.  The problem is, if we keep markup in 7-bit land,
>I'm pretty darn sure I can build an efficient parsing/validating system
>out of flex & yacc in a week or so (because I have), and utf8 data, for
>example, won't break anything.  Now, I'm not sure I *can't* do this if
>we let markup out of the ASCII box, but it's a question that we have to
>think about carefully, because we're in danger of compromising a central
>design goal.

If you don't restrict yourself to pure ASCII, you can still write a
minimal parser in a week, or less, using flex and yacc. Yes, I have
done this, and no, it is not difficult at all (in fact, I used a
variant of the TEI grammars).

>> you'd also require all content to be in
>>UTF8, and many users have no way of creating such data. In the best
>>cases, producing such data usually involves a conversion somwhere.
>Hmm, for us Westerners there's no problem, because the standard SGML
>repertoire is utf8 as it sits (right?).  You're right that your average
>Japanese editing system doesn't emit UTF8... but I still think that it
>might be reasonable to say that "in XML, it must be utf8" - so the
>conversion gets applied on the way in.  Question: conversions such as
>{S,N,EUC}JIS<->UTF8 and Big5<->UTF8 look very easy in principle - are they, 
>in practice?  And does software exist?

You are getting closer to what I am thinking, and indeed, what you
propose would be one scenario in what I have in mind. There are 2
problems here: 1) that some people confuse coded character sets and
encodings, and 2) it depends on what you see an SGML parser

I imagine a system that has a parser which deals in a single
normalised form (internal encoding). Whatever the data is actually
encoded in get's converted into this normalised form, and if we have a
single document character set, this conversion can be done blindly. I
recommend that the syntax use ISO 10646 because it covers a great
number of languages. Note that I do not believe it wise to prescribe
what the *internal* representation is, nor the number of possible
encodings and XML parser should handle.

Note that the tables for writing converters from SJIS, EUC, etc. to
UNICODE are publically available, and it is easy to use them to
automatically generate converters for what you wish to handle.

>I think there are many good reasons - overwhelmingly good - for the
>SGML case. For XML, we should think about sweeping all the
>charset/encoding subtleties under the rug: it's UTF8 and that's all
>there is to it.

UTF8 doesn't solve the worlds problems. I think we can fix the
character repertoire, but fixing the encoding is arbitrary, and
prescribes certain implementation details. It also complicates usage.

Follow-Ups: References: