- From: Gavin Nicol <gtn@ebt.com>
- Date: Tue, 10 Sep 1996 19:55:42 GMT
- To: tbray@textuality.com
- CC: w3c-sgml-wg@w3.org
>>syntax to support native language content. It seems to me that the >>most reasonable thing to do would be to decide upon a syntax that >>used ISO 10646 for both data and markup... > >Well, data for sure. The problem is, if we keep markup in 7-bit land, >I'm pretty darn sure I can build an efficient parsing/validating system >out of flex & yacc in a week or so (because I have), and utf8 data, for >example, won't break anything. Now, I'm not sure I *can't* do this if >we let markup out of the ASCII box, but it's a question that we have to >think about carefully, because we're in danger of compromising a central >design goal. If you don't restrict yourself to pure ASCII, you can still write a minimal parser in a week, or less, using flex and yacc. Yes, I have done this, and no, it is not difficult at all (in fact, I used a variant of the TEI grammars). >> you'd also require all content to be in >>UTF8, and many users have no way of creating such data. In the best >>cases, producing such data usually involves a conversion somwhere. > >Hmm, for us Westerners there's no problem, because the standard SGML >repertoire is utf8 as it sits (right?). You're right that your average >Japanese editing system doesn't emit UTF8... but I still think that it >might be reasonable to say that "in XML, it must be utf8" - so the >conversion gets applied on the way in. Question: conversions such as >{S,N,EUC}JIS<->UTF8 and Big5<->UTF8 look very easy in principle - are they, >in practice? And does software exist? You are getting closer to what I am thinking, and indeed, what you propose would be one scenario in what I have in mind. There are 2 problems here: 1) that some people confuse coded character sets and encodings, and 2) it depends on what you see an SGML parser manipulating. I imagine a system that has a parser which deals in a single normalised form (internal encoding). Whatever the data is actually encoded in get's converted into this normalised form, and if we have a single document character set, this conversion can be done blindly. I recommend that the syntax use ISO 10646 because it covers a great number of languages. Note that I do not believe it wise to prescribe what the *internal* representation is, nor the number of possible encodings and XML parser should handle. Note that the tables for writing converters from SJIS, EUC, etc. to UNICODE are publically available, and it is easy to use them to automatically generate converters for what you wish to handle. >I think there are many good reasons - overwhelmingly good - for the >SGML case. For XML, we should think about sweeping all the >charset/encoding subtleties under the rug: it's UTF8 and that's all >there is to it. UTF8 doesn't solve the worlds problems. I think we can fix the character repertoire, but fixing the encoding is arbitrary, and prescribes certain implementation details. It also complicates usage.
Received on Tuesday, 10 September 1996 15:56:55 UTC