- From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
- Date: Thu, 13 Jul 2000 22:50:56 -0600
- To: "Martin J. Duerst" <duerst@w3.org>, www-xml-schema-comments@w3.org
(I should note that in this posting I am speaking only for myself as an individual, not on behalf of any WG.) At 14:22 00/05/31 +0900, Martin J. Duerst wrote: >>[2] The spec says that lenght for strings is measured in terms of [Unicode] >> codepoints. This is technically correct, but it should say >> it's measured in terms of characters as used in the XML Recommendation >> (production [2]). Why? The relationship between XML characters and UCS characters is not going to change soon, is it? Adding a level of indirection won't help make things more maintainable -- if XML characters cease to correspond to UCS characters, changing this passage will be the very least of our worries. All we would accomplish with this change is to make things less clear to those who don't have production [2] of XML 1.0 in their heads. >>[4] related to [3]: XML is based on Unicode and therefore allows to >> represent a huge range of characters. However, XML explicitly >> excludes most control characters in the C0 range. There are fields >> in databases and programming languages that allow and potentially >> contain these characters. A user of XML and XML Schema has various >> alternatives, all not very satisfactory: >> 1) Drop the forbidden characters >> 2) Using XML Schema 'binary' with an encoding: This does not >> encode characters, but bytes, and therefore looses all i18n >> features of XML. There is a serious danger that this is used >> even when the data item in question or even the whole database >> does not contain a single such character. >> 3) Invent a private convention >> This is a serious problem, and should be duly addressed by >> XML Schema. I agree that alternatives 1 and 2 are not terribly satisfactory for users with such characters in their data; I am less certain that 3 is a bad idea. Either the control characters in question have some particular significance, in which case they can be represented as markup (perhaps as appropriate entity references), or they have no significance (in which case alternative 1 doesn't look so bad after all, but encoding the characters as entity references is probably safer and would prevent the discovery, too late, that they did mean something important after all). It is not clear how a schema language can usefully address this problem. A schema language cannot change the definition of XML 1.0. It cannot change the current state of telematics technology, or eliminate the risk that control characters will be taken as requests to control the state of a line over which a document is being sent, instead of being taken as part of the document -- this risk is smaller now than it used to be, but I doubt that it has disappeared completely. A schema language cannot change the fact that the control characters not allowed in XML 1.0 serve primarily functions which have been rendered obsolete by changes in data transmission practices or data base management systems. So even if we take as a given the claim that the existence of such data is a serious problem, I'm not clear what it is you think that a schema language can do about it. I'm also, I should say, a little hazy on the proposition that it *is* a serious problem. Can you supply any further information about the kinds of databases or programs and programming languages which may have this problem? I am handicapped a little by not being able to think of any examples of data with this problem except MARC records, which don't have the problem because they are not (and should not be) transmitted verbatim within XML documents. My inability to imagine real examples may make me less sympathetic to the problem than I would be otherwise, so if the preceding paragraphs sound harsh, please bear with me. >>[7] Make sure that functionality for locale-independent representation >> and locale-dependent information is clearly distinguished. >> This is the only way to assure both appropriate localization >> (which we consider to be very important) and worldwide data exchange. >> The specification is rather close to half of this goal, namely to >> provide locale-independent datatypes. Some serious improvements >> however are still possible and necessary (see below). >> It is clearly desirable that W3C also address locale-dependent >> data representations. We think that these go beyond simple datatyping/ >> exchange issues and include: >> - Conversion from locale-independent to locale-dependent representation >> - Conversion from locale-dependent to locale-independent representation >> - Association between locale-dependent and locale-independent information >> - ... >> These issues therefore have to be examined on a wider level >> involving various groups such as the XML Schema WG, the XSL WG, >> the CSS WG, the XForms WG, and so on, and this should be done as >> soon as possible. Support for locale-dependent representations will be made harder to add by the i18n WG's active opposition to the proposal for defining a set of abstract types at some level in the type hierarchy above each member of the current set of built-in types. Such abstract types would provide a natural representation, within the type hierarchy, of the relationships among types with locale-dependent variations in their lexical spaces. Without such abstract types, I see no way of supporting locale-dependent lexical spaces that does not amount to simply saying "These five types, which are all completely independent of each other in the type hierarchy, are -- by fiat -- known to be linked." I don't think a wise WG would accept such a design even for five locale-dependent types, let alone for the numbers of types needed in reality. Since the negative reaction of the i18n WG has had, I believe, an effect on the views of the XML Schema WG, I can only say that I believe you have made it rather difficult for the XML Schema spec to meet the requirements you describe here. >>[10] Several datatypes have more than one lexical representation for >> a single value. This gives the impression that these lexical >> representations actually allow some kind of localization or >> variation of representation. However, as explained above, >> such an impression is a dangerous misunderstanding, and has >> to be avoided at all costs. The variations in lexical forms accepted in the Last Call draft are, as far as I can see, all of them related to arithmetic facts (leading zeroes do not change the value of numbers) and the like, not to any attempt to prefer some locale-specific form or other. Can you explain how anyone could form the conclusion that allowing leading zeroes or optional plus signs is a way to allow localization? If the multiple lexical forms allowed for both '.' and ·, but not ',', as decimal points, I could see how people might reach such a conclusion. Are there locales which differ from each other in whether or not they allow leading zeroes or plus signs in integers? >> We therefore strongly request that all duplicate lexical >> representations be removed. If any of the duplicate lexical forms appear to favor some locales over others, then I can see why you are making this suggestion. But can you provide any examples of multiple lexical forms which favor some locales over others in ways that single lexical forms do not favor the same locales? (If you say that allowing leading zeroes favors those who are accustomed to writing numbers using the Western form of the Indo-Arabic digits, for example, I will reply that the favoritism lies in the choice of numeric digits and not in the allowance of the leading zero. So removing the leading zero doesn't seem to help remove, or even minimize, the bias toward Western locales that I agree it would be nice to avoid.) Is there another reason to avoid multiple lexical representations? >>[11] 3.2.2 'boolean': There are currently four lexical reps. for >> two values. This has to be reduced to two lexical reps. The >> I18N WG/IG here has a clear preference: >> most desirable: 0/1 >> less desirable: true/false 0 = true, 1 = false, or vice versa? >>[12] 3.2.3.1 'float' allows multiple representations. This must be fixed, >> e.g. as follows: ... Why? Your argument elsewhere in your comments is, I believe, that allowing multiple lexical forms for the same value favors, or could appear to favor, some locales over others. (Strictly speaking, you seem to be saying that allowing multiple lexical forms might make it easier for humans to read the data, and that you'd like to keep that at a minimum, so as to force the creators of user interfaces to translate from a hideous interchange form to a more readable display form. This clearly should apply only when a readable interchange form is biased in favor of a particular locale.) What locale is favored by allowing both "1" and "1E0" as lexical forms for the same floating-point value? I note in passing that your position on lexical forms seems to imply a model of processing which applies only in those cases where a value may legitimately be transformed into a different representation without loss of integrity. This is often true with databases and forms data. It is almost never true for documents. Your position on lexical representations, that is, seems to me to be based on the false assumption that schemas are relevant only for database-type data, and not for documents. It would be a grave mistake to allow such a wrong-headed belief to influence the design. >> [Some people may claim that e.g. the free choice of exponent or the >> use of leading digits is necessary to be able to mark up existing >> data; we would like to point out that if such claims should be made, >> we would have to request that not only such variations, but also >> other variations, e.g. due to the use of a different series of >> digits (Arabic-Indic, Devanagari,... Thai,..., Tibetan,..., >> ideographic,...) and so on be dealt with at the same level.] The ability to provide such support in the long term is one motivation for the proposal for a set of abstract types, which the i18n WG has opposed. In the short term, I would have guessed that the volume of electronic legacy data in the forms now allowed would be significantly higher than the volume of legacy data in the other forms known to humankind, by a margin large enough to motivate support for the current set of lexical representations as an 80/20 engineering choice. >>[24] ISO 8601 is based on the Gregorian calendar, but there >> seems to be no indication as to whether this is applicable >> before 1582, nor how exactly it would be applied. Also, >> it is unclear how far into the future the Gregorian calendar >> will be used without corrections. A representation purely >> based on days and seconds would avoid these problems; if >> this is not possible, then the spec needs some additonal >> explanations or references. The Gregorian calendar is a notation for writing dates. The range of its applicability or expressive power is independent of the dates at which the calendar was adopted by civil or other authorities in various regions. This point applies both with regard to the past (of course the Gregorian calendar provides notations for dates before 1582 -- it was designed, after all, to ensure that its notations and those of the Julian calendar would match for dates in the fourth century, which would have been difficult to manage if it had no notations for the fourth century) and with regard to the future (the future adoption of a different calendar with different correction practices will not render the Gregorian calendar meaningless or ill-defined, any more than the current widespread use of the Gregorian calendar means that the Julian, Revolutionary, Mayan, or Roman calendars have no notation for the day we write as 14 July 2000). >>[27] The lexical representation of 'hex' encoding (2.4.2.12) >> must be changed to allow only one case (e.g. only upper >> case) for hex digits. Why? What possible advantage accrues to anyone from such a change? Is it easier to write? Is it easier to read? Is it easier to parse and process? Is it less locale-dependent? I don't know enough about locales to be sure about the last question (I am not aware of any culture in which hex notation as we define it is a native cultural tradition, unless IBM mainframe systems geeks count as a distinctive culture), but the answer to the first three questions is quite simply No. In general, I have to say that while the i18n WG has been quite clear in expressing its desire to have only a single lexical representation for each value in the value space of a built-in type, you have been somewhat less clear in explaining how such a rule would, in practice or in theory, assist in furthering the ability of human beings to read and write data in the forms common in their culture, or the ability of software developers to write software that can be localized or internationalized more easily, or any of the other interests associated with internationalization. As I understand the account you give of this issue in your point [8], you would like the transfer form for every type to be as unreadable as possible, in order to ensure that ALL systems are built to include a translation between the transfer syntax and the form displayed to the user. As you may know, Donald Knuth designed the output of his TANGLE processor to be unreadable by humans, in order to encourage programmers to maintain the .WEB documents, not the TANGLE output. If I understand your position, it is very similar (mutatis mutandis) to Knuth's. Do I understand your position, or have I misunderstood? As I mentioned above, I think the view I've just attributed to you is too narrow a view of the applications to be supported by XML Schema -- in particular, it seems either to have no place for documents, or else to countenance behavior in connection with document display which I believe is not always appropriate, and is in some cases simply illegitimate. Could you expound? >>[39] Upgrade the reference to ISO 10646 to the year 2000 version, >> removing the reference to the amendments. >> >>[40] Upgrade the reference to Unicode to version 3.0. Since we are required by charter to apply to XML 1.0, it is not clear to me that it's a good idea to refer to versions different from those to which XML 1.0 refers. It appears rather that it would be a bad idea. -C. M. Sperberg-McQueen
Received on Friday, 14 July 2000 00:48:46 UTC