Re: Fwd: I18N Last call comments on Schema Part 2 from C. M. Sperberg-McQueen on 2000-07-14 (www-xml-schema-comments@w3.org from July to September 2000)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Thu, 13 Jul 2000 22:50:56 -0600
To: "Martin J. Duerst" <duerst@w3.org>, www-xml-schema-comments@w3.org
Message-Id: <4.1.20000713161428.01895f10@tigger.cc.uic.edu>
(I should note that in this posting I am speaking only for myself
as an individual, not on behalf of any WG.)

At 14:22 00/05/31 +0900, Martin J. Duerst wrote:

>>[2] The spec says that lenght for strings is measured in terms of [Unicode]
>>     codepoints. This is technically correct, but it should say
>>     it's measured in terms of characters as used in the XML Recommendation
>>     (production [2]).

Why?  The relationship between XML characters and UCS characters is not
going to change soon, is it?  Adding a level of indirection won't help make
things more maintainable -- if XML characters cease to correspond to 
UCS characters, changing this passage will be the very least of our
worries.  All we would accomplish with this change is to make things
less clear to those who don't have production [2] of XML 1.0 in their
heads.

>>[4] related to [3]: XML is based on Unicode and therefore allows to
>>     represent a huge range of characters. However, XML explicitly
>>     excludes most control characters in the C0 range. There are fields
>>     in databases and programming languages that allow and potentially
>>     contain these characters. A user of XML and XML Schema has various
>>     alternatives, all not very satisfactory:
>>     1) Drop the forbidden characters
>>     2) Using XML Schema 'binary' with an encoding: This does not
>>        encode characters, but bytes, and therefore looses all i18n
>>        features of XML. There is a serious danger that this is used
>>        even when the data item in question or even the whole database
>>        does not contain a single such character.
>>     3) Invent a private convention
>>     This is a serious problem, and should be duly addressed by
>>     XML Schema.

I agree that alternatives 1 and 2 are not terribly satisfactory for
users with such characters in their data; I am less certain that 3 is
a bad idea.  Either the control characters in question have some
particular significance, in which case they can be represented as markup
(perhaps as appropriate entity references), or they have no significance
(in which case alternative 1 doesn't look so bad after all, but encoding
the characters as entity references is probably safer and would prevent
the discovery, too late, that they did mean something important
after all).

It is not clear how a schema language can usefully address this problem.
A schema language cannot change the definition of XML 1.0.  It cannot
change the current state of telematics technology, or eliminate the risk
that control characters will be taken as requests to control the state of
a line over which a document is being sent, instead of being taken as 
part of the document -- this risk is smaller now than it used to be, but
I doubt that it has disappeared completely.  A schema language cannot change 
the fact that the control characters not allowed in XML 1.0 serve primarily
functions which have been rendered obsolete by changes in data transmission
practices or data base management systems.

So even if we take as a given the claim that the existence of such
data is a serious problem, I'm not clear what it is you think that
a schema language can do about it.  

I'm also, I should say, a little hazy on the proposition that it *is*
a serious problem.  

Can you supply any further information about the kinds of databases
or programs and programming languages which may have this problem?  I
am handicapped a little by not being able to think of any examples of
data with this problem except MARC records, which don't have the problem
because they are not (and should not be) transmitted verbatim within
XML documents.  My inability to imagine real examples may make me
less sympathetic to the problem than I would be otherwise, so if the
preceding paragraphs sound harsh, please bear with me.

>>[7] Make sure that functionality for locale-independent representation
>>     and locale-dependent information is clearly distinguished.
>>     This is the only way to assure both appropriate localization
>>     (which we consider to be very important) and worldwide data exchange.
>>     The specification is rather close to half of this goal, namely to
>>     provide locale-independent datatypes. Some serious improvements
>>     however are still possible and necessary (see below).
>>     It is clearly desirable that W3C also address locale-dependent
>>     data representations. We think that these go beyond simple datatyping/
>>     exchange issues and include:
>>     - Conversion from locale-independent to locale-dependent representation
>>     - Conversion from locale-dependent to locale-independent representation
>>     - Association between locale-dependent and locale-independent information
>>     - ...
>>     These issues therefore have to be examined on a wider level
>>     involving various groups such as the XML Schema WG, the XSL WG,
>>     the CSS WG, the XForms WG, and so on, and this should be done as
>>     soon as possible.

Support for locale-dependent representations will be made harder to add
by the i18n WG's active opposition to the proposal for defining a set of
abstract types at some level in the type hierarchy above each member of the 
current set of built-in types.  Such abstract types would provide a natural
representation, within the type hierarchy, of the relationships among
types with locale-dependent variations in their lexical spaces.  Without
such abstract types, I see no way of supporting locale-dependent lexical
spaces that does not amount to simply saying "These five types, which are
all completely independent of each other in the type hierarchy, are --
by fiat -- known to be linked."  I don't think a wise WG would accept such
a design even for five locale-dependent types, let alone for the numbers of
types needed in reality.

Since the negative reaction of the i18n WG has had, I believe, an effect
on the views of the XML Schema WG, I can only say that I believe you have
made it rather difficult for the XML Schema spec to meet the requirements
you describe here.

>>[10] Several datatypes have more than one lexical representation for
>>     a single value. This gives the impression that these lexical
>>     representations actually allow some kind of localization or
>>     variation of representation. However, as explained above,
>>     such an impression is a dangerous misunderstanding, and has
>>     to be avoided at all costs.

The variations in lexical forms accepted in the Last Call draft are, as
far as I can see, all of them related to arithmetic facts (leading
zeroes do not change the value of numbers) and the like, not to any
attempt to prefer some locale-specific form or other.  Can you explain
how anyone could form the conclusion that allowing leading zeroes or
optional plus signs is a way to allow localization?  If the multiple
lexical forms allowed for both '.' and &middot;, but not ',', as
decimal points, I could see how people might reach such a conclusion.
Are there locales which differ from each other in whether or not they 
allow leading zeroes or plus signs in integers?

>>     We therefore strongly request that all duplicate lexical
>>     representations be removed. 

If any of the duplicate lexical forms appear to favor some locales
over others, then I can see why you are making this suggestion.  But
can you provide any examples of multiple lexical forms which favor
some locales over others in ways that single lexical forms do not
favor the same locales?  (If you say that allowing leading zeroes
favors those who are accustomed to writing numbers using the Western
form of the Indo-Arabic digits, for example, I will reply that the 
favoritism lies in the choice of numeric digits and not in the allowance
of the leading zero.  So removing the leading zero doesn't seem to
help remove, or even minimize, the bias toward Western locales that 
I agree it would be nice to avoid.)  Is there another reason to avoid
multiple lexical representations?

>>[11] 3.2.2 'boolean': There are currently four lexical reps. for
>>     two values. This has to be reduced to two lexical reps. The
>>     I18N WG/IG here has a clear preference:
>>     most desirable:   0/1
>>     less desirable:   true/false

0 = true, 1 = false, or vice versa?

>>[12] 3.2.3.1 'float' allows multiple representations. This must be fixed,
>>     e.g. as follows:
...

Why?  Your argument elsewhere in your comments is, I believe, that allowing
multiple lexical forms for the same value favors, or could appear to
favor, some locales over others.  (Strictly speaking, you seem to be saying
that allowing multiple lexical forms might make it easier for humans to
read the data, and that you'd like to keep that at a minimum, so as to
force the creators of user interfaces to translate from a hideous
interchange form to a more readable display form.  This clearly should apply 
only when a readable interchange form is biased in favor of a particular
locale.)  What locale is favored by allowing both "1" and "1E0" as lexical
forms for the same floating-point value?

I note in passing that your position on lexical forms seems to imply a model
of processing which applies only in those cases where a value may legitimately
be transformed into a different representation without loss of integrity.
This is often true with databases and forms data.  It is almost never true
for documents.  Your position on lexical representations, that is, seems to
me to be based on the false assumption that schemas are relevant only for
database-type data, and not for documents.  It would be a grave mistake to
allow such a wrong-headed belief to influence the design.

>>     [Some people may claim that e.g. the free choice of exponent or the
>>      use of leading digits is necessary to be able to mark up existing
>>      data; we would like to point out that if such claims should be made,
>>      we would have to request that not only such variations, but also
>>      other variations, e.g. due to the use of a different series of
>>      digits (Arabic-Indic, Devanagari,... Thai,..., Tibetan,...,
>>      ideographic,...) and so on be dealt with at the same level.]

The ability to provide such support in the long term is one motivation for 
the proposal for a set of abstract types, which the i18n WG has opposed.

In the short term, I would have guessed that the volume of electronic legacy 
data in the forms now allowed would be significantly higher than the volume of
legacy data in the other forms known to humankind, by a margin large enough
to motivate support for the current set of lexical representations as an
80/20 engineering choice.

>>[24] ISO 8601 is based on the Gregorian calendar, but there
>>     seems to be no indication as to whether this is applicable
>>     before 1582, nor how exactly it would be applied. Also,
>>     it is unclear how far into the future the Gregorian calendar
>>     will be used without corrections. A representation purely
>>     based on days and seconds would avoid these problems; if
>>     this is not possible, then the spec needs some additonal
>>     explanations or references.

The Gregorian calendar is a notation for writing dates.  The range
of its applicability or expressive power is independent of the 
dates at which the calendar was adopted by civil or other authorities
in various regions.  This point applies both with regard to the past
(of course the Gregorian calendar provides notations for dates
before 1582 -- it was designed, after all, to ensure that its notations
and those of the Julian calendar would match for dates in the
fourth century, which would have been difficult to manage if it
had no notations for the fourth century) and with regard to the 
future (the future adoption of a different calendar with different
correction practices will not render the Gregorian calendar
meaningless or ill-defined, any more than the current widespread
use of the Gregorian calendar means that the Julian, Revolutionary,
Mayan, or Roman calendars have no notation for the day we write
as 14 July 2000).

>>[27] The lexical representation of 'hex' encoding (2.4.2.12)
>>     must be changed to allow only one case (e.g. only upper
>>     case) for hex digits.

Why?  What possible advantage accrues to anyone from such a change?
Is it easier to write?  Is it easier to read? Is it easier to
parse and process? Is it less locale-dependent?  I don't know enough
about locales to be sure about the last question (I am not aware
of any culture in which hex notation as we define it is a native
cultural tradition, unless IBM mainframe systems geeks count as
a distinctive culture), but the answer to the first three questions
is quite simply No.

In general, I have to say that while the i18n WG has been quite clear
in expressing its desire to have only a single lexical representation
for each value in the value space of a built-in type, you have been
somewhat less clear in explaining how such a rule would, in practice
or in theory, assist in furthering the ability of human beings to
read and write data in the forms common in their culture, or the
ability of software developers to write software that can be localized
or internationalized more easily, or any of the other interests
associated with internationalization.

As I understand the account you give of this issue in your point [8],
you would like the transfer form for every type to be as unreadable 
as possible, in order to ensure that ALL systems are built to include
a translation between the transfer syntax and the form displayed to
the user.  As you may know, Donald Knuth designed the output of 
his TANGLE processor to be unreadable by humans, in order to encourage 
programmers to maintain the .WEB documents, not the TANGLE output.  If
I understand your position, it is very similar (mutatis mutandis) to 
Knuth's.  Do I understand your position, or have I misunderstood?

As I mentioned above, I think the view I've just attributed to you is too 
narrow a view of the applications to be supported by XML Schema -- in 
particular, it seems either to have no place for documents, or else to 
countenance behavior in connection with document display which I believe 
is not always appropriate, and is in some cases simply illegitimate.

Could you expound?  

>>[39] Upgrade the reference to ISO 10646 to the year 2000 version,
>>     removing the reference to the amendments.
>>
>>[40] Upgrade the reference to Unicode to version 3.0.

Since we are required by charter to apply to XML 1.0, it is not clear
to me that it's a good idea to refer to versions different from those
to which XML 1.0 refers.  It appears rather that it would be a bad idea.

-C. M. Sperberg-McQueen
Received on Friday, 14 July 2000 00:48:46 UTC