Re: Concrete syntax, character sets
>1. Document *data* is (mostly) for people to read, and thus of course
> has to support the languages they write in. Document *markup* is
> (mostly) for computer programs to read, plus the occasional unfortunate
> document designer. Given that these things are already monocased,
> and by industry habit that I doubt XML will break, short, it's not
> clear that expressing GI's & attribute names in Cyrillic or Chinese is all
> that important to the market.
For document designers, my experience has been that about 50% or the
Japanese people I talk to wish for Japanese markup. The people who are
happy with ASCII markup, usually feel that it is better for
interoperability. However, the people who want native language markup
usually cite usability as the prime reason: it is much more
understandable to have "bunsho" in a Japanese document, and in
stylesheets, it becomes even more desireable, they say.
For Japanese, it is not an overly large problem, because they have a
phonetic spelling of Japanese that uses ASCII (romanji), but for other
languages, ASCII phonetics as markup don't win.
>2. Supporting bigger & more complex encodings in markup brings the benefit
> of making life easier & friendlier for document designers who want to
< use them. Restricting the markup character set down to 7 bits brings
> the benefit of making it quicker & easier to generate software that
> processes such markup. If I didn't already think that the second
> of these two incompatible benefits was more important, I wouldn't
> be working on XML.
This is a fallacy. If you are going to support native language
content, you will have to have some way of decoding the octet stream
in order to correvtly parse the document (otherwise you run into
problems with bits of character codes that could be mistaken for
If you have a decoding module on the stream (or a bit combination
transformation filter), you will also be able to support native