- From: Tim Bray <tbray@textuality.com>
- Date: Tue, 10 Sep 1996 09:27:01 -0700
- To: w3c-sgml-wg@w3.org
I feel I should apologize in advance here; I am keenly aware of the fact that the 7-bit-centric-ness of contemporary computer systems is redolent of a lot of really bad Western attitudes, so arguing against complete i18n generality makes me really uncomfortable. And in fact some of the things that Henry & Gavin & Martin have been saying are very important. But, let's review the design principles. XML has to be *easy to write programs for* - I draw your attention to http://www.textuality.com/sgml-erb/dd-1996-0001.html, clause 4, which says that anyone competent ought to be able to cook up an XML processor in a few days. For this reason, we are probably going to discard a bunch of pieces of SGML that are really potentially very valuable, such as abstract syntax and SHORTREF and LINK - simply because we are happy with the trade-off of increased ease of implementation for loss of functional richness. So even though I may end up agreeing with Gavin, for the moment I'm going to argue the minimalist position, because someone needs to. My informal interpretation of our clause 4 ("It shall be easy to write programs which process XML documents") has been more or less, "I can do it with flex and yacc". Thus, what we need to do is avoid the extremes of (a) retreating into a 7-bit grass hut, and (b) trying to build something handles all the arcana of bit-patterns/glyphs/charsets/encodings. Rather, we want something that hits the right trade-off along the spectrum. Having said all that: At 02:31 PM 9/10/96 GMT, Gavin Nicol wrote: >I would feel somewhat strange about not supporting native language >markup, particularly as we're going to have to use a variant concrete >syntax to support native language content. It seems to me that the >most reasonable thing to do would be to decide upon a syntax that >used ISO 10646 for both data and markup... Well, data for sure. The problem is, if we keep markup in 7-bit land, I'm pretty darn sure I can build an efficient parsing/validating system out of flex & yacc in a week or so (because I have), and utf8 data, for example, won't break anything. Now, I'm not sure I *can't* do this if we let markup out of the ASCII box, but it's a question that we have to think about carefully, because we're in danger of compromising a central design goal. > you'd also require all content to be in >UTF8, and many users have no way of creating such data. In the best >cases, producing such data usually involves a conversion somwhere. Hmm, for us Westerners there's no problem, because the standard SGML repertoire is utf8 as it sits (right?). You're right that your average Japanese editing system doesn't emit UTF8... but I still think that it might be reasonable to say that "in XML, it must be utf8" - so the conversion gets applied on the way in. Question: conversions such as {S,N,EUC}JIS<->UTF8 and Big5<->UTF8 look very easy in principle - are they, in practice? And does software exist? >Again, we had the same discussion in HTML-WG. There are many good >reasons for selecting a single document character set, and then just >looking upon SJIS and whatnot as encodings. I think there are many good reasons - overwhelmingly good - for the SGML case. For XML, we should think about sweeping all the charset/encoding subtleties under the rug: it's UTF8 and that's all there is to it. Once again, if it's not crystal clear - we want anyone who wants to write a program to process XML to be able to go to one place and look at one simple set of rules, and we want the program, once written, to have a very high degree of probability of processing any XML doc in the world *as it sits*. If we get this, and as a consequence there is some proportion of documents that XML cannot address, that's a fair price to pay. E.g., a trade-off such as "If you *REALLY* *MUST* have the documents in the repository in native Shift-JIS, and can't do conversions in & out, then you have to use real SGML, because it has the machinery for that." Once again, I'm *really* not a monoglot cave-dwelling bigot. Cheers, Tim Bray tbray@textuality.com http://www.textuality.com/ +1-604-488-1167
Received on Tuesday, 10 September 1996 12:23:53 UTC