generally not in favour

(I posted this also to XML-DEV.)

1. What is XML 1.0's native-language markup for?

There is hierarchy of suitability for native-language markup:

   - free text *must*  support native scripts
   - choices presented to users *must* have it
   - things which users name *should* have it (e.g. directory and filenames)
   - things which are made for regional/inhouse/personal use *may* support
as much as possible
   - things which which are named by central authorities and the users
     does not have control *must* be limited to the national standard
characters
     or official scripts or Latin script or English, depending on
circumstances
   - names for things that are needed by a translingual usership (where
these users include alphabet users) *should not* use it
   - things which are standard keywords *must not* have it (e.g. "ELEMENT"
     keyword in XML, or "const" in C++)

The advent of XML Schemas:Datatypes has changed how we might apply these
principles (presuming we accept them).

In XML 1.0, the only way of providing enumerations was through a DTD. An
enumeration is an XML name. Therefore XML 1.0 names *must* support
native-scripts thoroughly.

But now we have XML Schemas: Datatypes, and we can use it ourselves to make
our own token types.  So that removes the only *must* from our list.

So I do not believe the proposed Blueberry changes fall into the category of
"must" (i.e. if XML is unsuitable for some end-users) but into "should"
(i.e. if XML is unsuitable for some programmers) or even "may".

Indeed, I believe some of the characters in question *must not* be allowed
as name character.  The purpose of markup is to allow data to be clear for
humans to read.  An obscure character, or one which an ordinary programmer
(who uses the script involved) will find difficult to read, write,
pronounce, comprehend, is positively bad markup.

So, I believe there is no current urgency to make the Blueberry changes as
far as XML Name characters is concerned.  XML Schemas Datatypes allows us to
define native-script enumerations, so there is no end-user requirement.
Obscure characters are bad markup, so there is no programmer requirement for
most of the scripts in question.

I would rather the following approach was adopted:

   An erratum to XML 1.02e should be published saying
   "It is not a reportable error for a character > U+10000
   (e.g. 𐀀) to appear in a name character."

This opens the door for a future revision to XML
(e.g. a more thoroughgoing one) by reducing cases
where new XML documents (with naming rules
such as the ones suggested) are rejected by old
parsers.   Of course, the number of these is likely
to be almost 0, so this seems like a case of
people creating work for themselves (not
that it is a bad thing...it is important for the
right message to be given, etc.)

The issue of the IBM line-end character is
a different issue.   Personally, I think it should
be magic-ed away by entity management.
The "unnessecary translation phases before and
after XML parses and generation" are the
most straight-forward way out for everyone.
If it does not meet IBM's supposed requirement
entirely, that is the price of interoperability:
XML does not guarantee round-tripping of
new line characters (it cannot, because whenever
data is send text/*, intermediate proxies can
change to local conventions).

Cheers
Rick Jelliffe

Received on Friday, 22 June 2001 01:43:46 UTC