Re: Concrete syntax, character sets

I feel I should apologize in advance here; I am keenly aware of the fact that
the 7-bit-centric-ness of contemporary computer systems is redolent of a lot
of really bad Western attitudes, so arguing against complete i18n generality
makes me really uncomfortable.  And in fact some of the things that Henry
& Gavin & Martin have been saying are very important.

But, let's review the design principles.  XML has to be *easy to write
programs for* - I draw your attention to 
http://www.textuality.com/sgml-erb/dd-1996-0001.html, clause 4, which says
that anyone competent ought to be able to cook up an XML processor in a
few days.  For this reason, we are probably going to discard a bunch of pieces 
of SGML that are really potentially very valuable, such as abstract syntax
and SHORTREF and LINK - simply because we are happy with the trade-off 
of increased ease of implementation for loss of functional richness.  So even
though I may end up agreeing with Gavin, for the moment I'm going to argue
the minimalist position, because someone needs to.

My informal interpretation of our clause 4 ("It shall be easy to write
programs which process XML documents") has been more or less, "I can 
do it with flex and yacc".

Thus, what we need to do is avoid the extremes of (a) retreating into a
7-bit grass hut, and (b) trying to build something handles all the arcana
of bit-patterns/glyphs/charsets/encodings.  Rather, we want something that
hits the right trade-off along the spectrum.

Having said all that:

At 02:31 PM 9/10/96 GMT, Gavin Nicol wrote:
>I would feel somewhat strange about not supporting native language
>markup, particularly as we're going to have to use a variant concrete
>syntax to support native language content. It seems to me that the
>most reasonable thing to do would be to decide upon a syntax that
>used ISO 10646 for both data and markup... 

Well, data for sure.  The problem is, if we keep markup in 7-bit land,
I'm pretty darn sure I can build an efficient parsing/validating system
out of flex & yacc in a week or so (because I have), and utf8 data, for
example, won't break anything.  Now, I'm not sure I *can't* do this if
we let markup out of the ASCII box, but it's a question that we have to
think about carefully, because we're in danger of compromising a central
design goal.

> you'd also require all content to be in
>UTF8, and many users have no way of creating such data. In the best
>cases, producing such data usually involves a conversion somwhere.

Hmm, for us Westerners there's no problem, because the standard SGML
repertoire is utf8 as it sits (right?).  You're right that your average
Japanese editing system doesn't emit UTF8... but I still think that it
might be reasonable to say that "in XML, it must be utf8" - so the
conversion gets applied on the way in.  Question: conversions such as
{S,N,EUC}JIS<->UTF8 and Big5<->UTF8 look very easy in principle - are they, 
in practice?  And does software exist?

>Again, we had the same discussion in HTML-WG. There are many good
>reasons for selecting a single document character set, and then just
>looking upon SJIS and whatnot as encodings.

I think there are many good reasons - overwhelmingly good - for the SGML case. 
For XML, we should think about sweeping all the charset/encoding subtleties
under the rug: it's UTF8 and that's all there is to it.  Once again, if
it's not crystal clear - we want anyone who wants to write a program to
process XML to be able to go to one place and look at one simple set of
rules, and we want the program, once written, to have a very high degree
of probability of processing any XML doc in the world *as it sits*.  If we
get this, and as a consequence there is some proportion of documents that
XML cannot address, that's a fair price to pay.

E.g., a trade-off such as "If you *REALLY* *MUST* have the documents
in the repository in native Shift-JIS, and can't do conversions in & out, 
then you have to use real SGML, because it has the machinery for that."

Once again, I'm *really* not a monoglot cave-dwelling bigot.

Cheers, Tim Bray
tbray@textuality.com http://www.textuality.com/ +1-604-488-1167

Received on Tuesday, 10 September 1996 12:23:53 UTC