- From: Martin J. Duerst <duerst@w3.org>
- Date: Wed, 31 May 2000 14:22:49 +0900
- To: www-xml-schema-comments@w3.org
Forwarded, on request of C. M. Sperberg-McQueen. >Date: Mon, 29 May 2000 18:57:09 +0900 >From: "Martin J. Duerst" <duerst@w3.org> >Subject: I18N Last call comments on Schema Part 2 > >Dear Schema WG, > >[This mail is crossposted to the I18N IG to allow for further discussion. >Please feel free to forward these comments to another list, including >a public list, but please make sure that you don't reveal the mail >addresses of the various groups.] > >This are the last call comments on XML Schema Part 2: Datatypes >from the I18N WG/IG. > >The comments are numbered by [n], but their order does not >reflect their importance. > > >[1] The definition of 'match' has been copied from XML 1.0. There > are propsals for clarifying XML 1.0. The Schema WG should > work together with the XML Core WG and the I18N IG to > make sure everything is in sync. > >[2] The spec says that lenght for strings is measured in terms of [Unicode] > codepoints. This is technically correct, but it should say > it's measured in terms of characters as used in the XML Recommendation > (production [2]). > >[3] In 2.4.2.12, it says 'For example, "20" is the hex encoding for the > US-ASCII space character'. It should say something like '"20" encodes > a byte value represented e.g. in C as 0x20, which may stand for the > space character if US-ASCII (or UTF-8) is used to encode it.' > But actually this is a bad example, because encoding text with > base64 is a bad idea and is against the spirit of XML. > >[4] related to [3]: XML is based on Unicode and therefore allows to > represent a huge range of characters. However, XML explicitly > excludes most control characters in the C0 range. There are fields > in databases and programming languages that allow and potentially > contain these characters. A user of XML and XML Schema has various > alternatives, all not very satisfactory: > 1) Drop the forbidden characters > 2) Using XML Schema 'binary' with an encoding: This does not > encode characters, but bytes, and therefore looses all i18n > features of XML. There is a serious danger that this is used > even when the data item in question or even the whole database > does not contain a single such character. > 3) Invent a private convention > This is a serious problem, and should be duly addressed by > XML Schema. > [There is a related problem with respect to names (GIs in > SGML terminology), but this is more an XML 1.0 problem than > an XML Schema problem, and there is no danger to loose all > i18n information just because of a single character.] > >[5] related to [4]: 3.2.1 seems to allow all Unicode/ISO 10646 characters, > this is not true (see [4]). > >[6] 3.2.1: Expand 'UCS' to Universal Character Set. > >[7] Make sure that functionality for locale-independent representation > and locale-dependent information is clearly distinguished. > This is the only way to assure both appropriate localization > (which we consider to be very important) and worldwide data exchange. > The specification is rather close to half of this goal, namely to > provide locale-independent datatypes. Some serious improvements > however are still possible and necessary (see below). > It is clearly desirable that W3C also address locale-dependent > data representations. We think that these go beyond simple datatyping/ > exchange issues and include: > - Conversion from locale-independent to locale-dependent representation > - Conversion from locale-dependent to locale-independent representation > - Association between locale-dependent and locale-independent information > - ... > These issues therefore have to be examined on a wider level > involving various groups such as the XML Schema WG, the XSL WG, > the CSS WG, the XForms WG, and so on, and this should be done as > soon as possible. > > We would like to repeat that any mixup between locale-independent > and locale-dependent data representation will lead to confusion and > will hurt, and not benefit, internationalization and localization. > (This point is further addressed in detail in some of the points below: > [8], [9], [10]-[16], [20]). > >[8] Say explicitly in the specification and in the primer that the lexical > representations you provide for various datatypes (in particular > things such as date, numbers,...) are designed for locale-independent > data exchange, and that they are inappropriate for locale-dependent > data representation. In the primer, an example such as > <date value='2000-05-16'>Tuesday, 16th of March, 2000</date> > (or even just something like <date value='2000-05-16'>next > Tuesday</date>) > with value defined as a date and the <date> content as string, would > help. > Also, explicitly warn that where there is some similarity between > localized representations and the locale-independent representation, > this must not be exploited when presenting the data to a user, > and that similarities are due to > - Having to choose *some* kind of representation > - Making this representation somewhat manageable in raw text > for when raw text is needed (debugging, plain text editing,...) > and that the fact that some representations are more similar to > some locales than others is done reluctantly, and not explicitly to > disadvantage certain users. [Indeed, where possible, we would prefer > representations that avoid any similarity to any existing locale.] > >[9] As said above and explained below, addressing localized representations > as a whole is a huge problem. The one contribution that seems most > appropriate and relevant from XML Schema is to associate locale- > independent and locale-dependent representations. Taking the example > above, <date value='2000-05-16'>Tuesday, 16th of March, 2000</date>, > the association between the locale-independent 'value' and the > locale-dependent element content is implicit; XML Schema should > provide a way to make this association explicit. Including in the > association some way to indicate the local format used / the conversion > functions necessary seems also desirable, although we are not yet > aware of an interoperable way to do so. > >[10] Several datatypes have more than one lexical representation for > a single value. This gives the impression that these lexical > representations actually allow some kind of localization or > variation of representation. However, as explained above, > such an impression is a dangerous misunderstanding, and has > to be avoided at all costs. > We therefore strongly request that all duplicate lexical > representations be removed. The following points ([11]-[16],[20], > [22], [27]) > give details for each affected datatype. For each datatype, > we indicate where duplicate representations exist, and > how it may be removed. Unless otherwise indicated, we do > not have any particular preferences of how to remove the > duplicates; we just explain one way to do so to allow you > to reuse the analysis we (mostly Mark Davis) have already done. > We would like to point out that reducing the lexical representations > to a single one for each value also makes using digital signatures > on such data a lot easier, and to a large extent and at very > little cost, avoids the creation of another WG and spec like > in the case of XML Canonicalization. > >[11] 3.2.2 'boolean': There are currently four lexical reps. for > two values. This has to be reduced to two lexical reps. The > I18N WG/IG here has a clear preference: > most desirable: 0/1 > less desirable: true/false > clearly absolutely undesirable: 0/1/true/false > >[12] 3.2.3.1 'float' allows multiple representations. This must be fixed, > e.g. as follows: > > Float values have a single standard lexical representation consisting > of a > mantissa, followed by the character "E" (upper case only), followed by an > exponent. The exponent must be an integer. The mantissa must be a decimal > number. The representations for exponent and mantissa must follow the > lexical rules for integer and decimal numbers discussed > above[below?]. The > absolute value of the mantissa must be either zero, or greater than or > equal to 1 and less than 10. If the mantissa is zero, then the > exponent must > be zero. For example: > Valid: "-1.23E5", "9.9999E14", "1.0000001E-14", "0E0", "1E0" > Invalid: "+1.23E5", 100000.0E3", "1.0E3", "1.0E0", "012.E3", "0E1" > [This leaves one issue open, namely the issue of too high precision. > one way to solve this is to define that the lexical rep. chosen is > the one with the shortest lexical rep of the mantissa that corresponds > to the desired value according to [Clinger/Gay], or if two lexical > reps with the same shortest mantissa correspond, then the closer > one should be chosen, and if both are equally close, then the > one with an even end digit is chosen. [This should cover all cases, > but there may be more accurate or more easy to calculate alternatives, > and this should be checked by experts.]] > [Some people may claim that e.g. the free choice of exponent or the > use of leading digits is necessary to be able to mark up existing > data; we would like to point out that if such claims should be made, > we would have to request that not only such variations, but also > other variations, e.g. due to the use of a different series of > digits (Arabic-Indic, Devanagari,... Thai,..., Tibetan,..., > ideographic,...) and so on be dealt with at the same level.] > >[13] 3.2.4.1 'double' allows multiple representations. This must be fixed. > The solution lined out in [12] can be applied. > >[14] 3.2.5.1 'decimal' allows multiple representations. This must be fixed, > e.g. as follows: > > Decimal values have a single, unique, lexical representation. This > consists > of a string of digits (x30 to x39) with a period (x2E) as a decimal > indicator (in accordance with the scale and precision facets), and a > leading > minus sign (x2D) to indicate a negative number. The decimal indicator > must be omitted if there are no fraction digits. Leading and trailing > zeros > are illegal, except for zero itself (which is written as "0"). For > example: > Valid: "-1.23", 100000", "12678967.543233", "0" > Invalid: "+1.23", 100000.0", "12,678,967.543233", "12,678,967.543233", > "0.0", "012." > >[15] Lexical representation of derived datatypes: The lexical > representation of all datatypes derived (directly or indirectly) > from 'decimal' (13 types from 'integer' to 'positiveInteger') > must be changed to be unique. The easiest and most consistent way > to do this is to just specify for each datatype that the lexical > representation for all the values of the type is the same as for > 'decimal'. > If you want to be specific, you can find some details at: > http://lists.w3.org/Archives/Member/w3c-i18n-wg/1999Nov/0007.html > (members only). In any case, disallowing a '+' (done on some types, > but not consistently) and disallowing leading zeroes should do > the job. > >[16] For elementary types, there may be a desire to allow whitespace > around the actual data. To be clear, the spec should explicitly > say that this is disallowed. (except for cases where it has to > be allowed for XML/SGML conformance, i.e. ENTITY, ID,...). > Another way of expressing this comment is to say that the > spec should make clear for which datatypes CDATA attribute-value > normalization should be chosen, and for which datatypes not. > >[17] The time-related datatypes (timeDuration and recurringDuration > and derived datatypes) need to be redesigned to avoid a number > of serious problems. For details, please see points [18]-[25]. > >[18] The specification assumes that usual arithmetic can be done > with TimePeriod, but due to the representation chosen, this > is not the case. For example, it is absolutely unclear which of > P3.01M or P90.5D is greater, or whether they are equal. There are > two ways to solve this, either to choose a different representation > or to remove orderedness and min/maxIn/Exclusive. The former > is clearly desirable because of additional reasons, please see [19]. > >[19] The use of culture-specific time length units is highly problematic. > This in particular applies to years and months in timeDuration. > Various calendars use different month and year lengths; the main > distinction being the one between lunar calendars and solar calendars. > The Islamic, Hebrew, and Chinese months and years, for example, are > all different from the corresponding western units. > A system either has to be able to represent these units in all > calendars (extremely difficult) or should be limited to representations > that are to an extremely high degree culturally neutral. In order > to deal with [18], too, we propose to do the later. > >[20] Unique representation of timeDuration: There must be only one > lexical representation for each timeDuration. This can be achieved > as follows: > Based on the representation of ISO 8601, only PnDTnHnMnS is used > (i.e. no years or months). If any unit is zero, the number and the > letter after it are removed, except for the zero duration, which > is represented as P0D. If only Days are present, 'T' is omitted. > Overflows in lower units have to be converted to higher units > (i.e. PT24H -> P1D, PT60M -> PT1H, PT60S -> PT1M; except for > leap second cases). Decimal fractions are only allowed for > seconds, and do not allow trailing zeroes. > [A serious alternative to this would be to remove timeDuration > altogether.] > >[21] The problems with timeDuration ([19]-[21]) heavily affect > recurringDuration and all datatypes derived from it. In addition > to the arguments above, recurringDuration is clearly of verylimited > use even for the areas of the world that use the Gregorian calendar > for all their activities. Being able to specify e.g. the 5th > of May every year is only of limited value; most events are > decided according to a much more complex pattern. The 3rd > Wednsday of each month, a certain date if it is not a Sunday, > otherways the Monday after it, and so on, are easier examples, > and things can get more complex. With the current solution > only a small part of the actual requirements can be addressed. > Therefore, the datatype 'recurringDuration' must be removed. > Several derived datatypes will be removed as a consequence > (e.g. timePeriod, recurringDate, recurringDay, time,...). > [The only viable alternative to this is to work on a more > powerful representation can can address both various cultures > and more complicated rules.] > >[22] Having a datatype for timeInstant is clearly desirable. The current > derived type should be promoted to a base type. Ideally, the > representation > should be based only on days (and seconds within the day) from an > arbitrary but clearly specified base time instant (this would greatly > simplify conversions to internal representation of all kinds of > OSs and libraries). If this is judged to be not enough readable > in plain text, the current scheme based on ISO 8601 may be > kept (but should be verified to be absolutely clean of double > lexical representations). Please note that while the representation > in this case would not be culturally neutral, each timeInstant > can with appropriate calculations be represented in a different > calendar without problems. > >[23] It may be reasonable to consider a datatype 'date', which > is related to timeInstant but most probably best defined as > a separate base type. 'month', 'year', and 'century' have > to be removed for the reasons given above. It may be worth > defining a 'composite' datatype 'actualTimePeriod', which > consists of a start timeInstant and an end timeInstant. > This would cover a lot more (and a lot more useful) cases > in a much more uniform manner than what is currently possible, > and could even replace 'date'. > >[24] ISO 8601 is based on the Gregorian calendar, but there > seems to be no indication as to whether this is applicable > before 1582, nor how exactly it would be applied. Also, > it is unclear how far into the future the Gregorian calendar > will be used without corrections. A representation purely > based on days and seconds would avoid these problems; if > this is not possible, then the spec needs some additonal > explanations or references. > >[25] Several details in appendix D have to be fixed. It has to > be clear that leading zeroes for months and days are needed. > Hours obviously go from 0 to 23, minutes from 0 to 59. > Seconds indeed can go to 60 in the case of leap seconds, > but only in that case. > >[26] For international data interchange, a uniform way to transmit > measurements not only for time lengths and time instants, > but all kinds of other units, seems highly desirable. If > this cannot be provided in the first version of XML Schema, > it clearly should be taken up soon for the next version. > >[27] The lexical representation of 'hex' encoding (2.4.2.12) > must be changed to allow only one case (e.g. only upper > case) for hex digits. > >[28] String length: There should be a note saying that string > length as defined here does not always coincide with string > length as perceived by the user or with an actual amount > of storage units in some digital representation, and that > therefore care should be taken both when specifying some > bounds as well as when using these bounds to try to derive > some storage requirements. > [Although this is not an i18n issue, our group also found > the simultaneous availability of 'length', 'minLength', > and 'maxLength' highly confusing.] > >[29] String ordering: This feature seems to be present for no > real use, and should be removed. User-oriented string > ordering is highly complex and locale-dependent, and is > dealt with in other standards (ISO/IEC 14651 and Unicode TR #10). > Locale-independent ordering only makes sense if it is usable > for something. This may be actually the case if it were > possible to specify that all subelements of a given element > have to appear in a given order (just to avoid variation). > If this is possible with XML Schema, the orderedness of > string may be kept. If not, orderedness as a facet should > be removed altogether. > In any case, the related facets min/maxIn/Exclusive must be > removed, because they never lead to any useful subset of > strings. (E.g. assume minInclusive='a' and maxExclusive='b'. > This makes sure the first letter is a lower case 'a', but > allows any letter whatsoever (from the whole Unicode repertoire) > after the 'a'. This is most probably not what a naive user > is expecting (but as good as we can get), and for an advanced > user, this (and many other useful things) are much easier specified > by patterns). > >[30] URI Reference: This definition must be changed to allow for > characters not allowed in URI References, in order to be in > accordance with the relevant section of the W3C Character Model > (http://www.w3.org/TR/charmod/#URIs) and all the W3C Recommendations > and upcomming Recommendations in accordance with it (HTML 4.0, > XML 1.0, RDF, XPointer, XLink,...). > [While at it, please also remove the definitions of 'absolute > uriReference' and 'relative uriReference' if you don't use it, > and make sure you mention that RFC 2396 has been updated by > RFC 2732: Format for Literal IPv6 Addresses in URL's > R. Hinden, B. Carpenter, L. Masinter, December 1999. > e.g. at http://www.ietf.org/rfc/rfc2732.txt] > >[31] 3.3.1 language: The 'LanguageID' production in XML 1.0 is too > narrow. It fits the currently allowed languageIDs of RFC 1766 > tightly, but RFC 1766 is being upgraded (see >http://search.ietf.org/internet-drafts/draft-alvestrand-lang-tags-v2-01.txt). > The I18N WG/IG are working together with the XML Core WG to > make sure XML can be adjusted appropriately, and that no > premature overly restrictive decisions are taken. The XML > Schema WG should work together with the above WG to coordinate > this issue. > >[32] The 'length/minLenght/maxLength' facets on 'language' are > highly doubtful; they do not correspond to any useful > concepts in the value domain of this datatype. > >[33] It is unclear why certain datatypes are derived from > 'string' (e.g. language, nmtoken, name, ncname), but not > others (e.g. ID, idref, entity, notation, qname). > >[34] Pattern combinations: Section 5.2.4 says that multiple patterns > in a derivation of a single type are combined as if they were > separate branches of a regular expression. Branches result > in an 'OR' combination, i.e. the actual string can conform > to either branch. It seems much better to change this to > an 'AND' condition, i.e. the actual string has to conform > to BOTH regular expressions. There are several reasons for > this: > - Restrictions on all kinds of facets, on the same derivation > or on subsequent derivations, can very generally be modeled > as AND conditions (i.e. for a derived simple type, all > conditions on that type and any base types apply simultaneously). > This allows to deal uniformly with all such restrictions, > and to avoid special cases. E.g. instead of saying that > having both a minInclusive and a minExclusive on the same > derivation is illegal, one of them just becomes redundant. > - The regular expression syntax does not allow AND conditions. > However, such conditions are frequently used in programming. > In programming, they don't have to be part of the regexp > syntax, because they can be modelled as two subsequent checks. > In XML Schema, there is no device for subsequent checks. > - AND conditions on regular expressions are in particular > important for i18n (see point [35]). > >[35] It has to be possible to specify various restrictions on > a string simultaneously. In particular, we expect that > combining a restriction regarding the character repertoire > (e.g. to deal with encoding restrictions in legacy systems) > and a restriction on the structure of a string will be > quite frequent. See also point [36]. > >[36] Some of the regular expressions needed will be quite long. > As an example, the regular expression to limit the repertoire > to those characters expressible in the traditional Japanese > encodings results in a character class with about 6000 > characters. To make this reasonably possible, we suggest: > - To allow XML spaces in regular expressions (including > character classes) in the same way they are allowed in the > newer Perl versions. This will lead to greater readability > for many other applications, too. > - To allow to define character classes or regular expressions > in general as objects of their own that can be referenced > either in a 'pattern' element or directly in a regular > expression or character class. > - If the point just above is not possible, in any case > to make sure that patterns are combined by 'AND' in > the derivation hierarchy. > >[37] In appendix E, remove the 'CS Surrogate' character property. > Surrogates do not appear on the level that XML Schema is working. > >[38] For character sequence '\w', please make sure that the character > class does not end at  but at , and that this > is consistent in the primer. > >[39] Upgrade the reference to ISO 10646 to the year 2000 version, > removing the reference to the amendments. > >[40] Upgrade the reference to Unicode to version 3.0. > >[41] Make sure whether/that block escapes are normative (i.e. > change the various 'may' in their definition to something more > appropriate). > >[42] Try to give less US-centric examples. > >[43] Make sure that the character property categories and block > escape classes for Unicode characters are not bound to a single > version of Unicode. This would create an update problem as soon > as Unicode is updated, which is sure to happen rather soon. > XML Schema should be independent of such upgrades, otherwise > this part of it will soon be less and less useful. The pointer > to version 3.0.0 of the Unicode Database should be changed to > a generic pointer to the latest version. > >[44] The current regular expression syntax does not take into account > combinations of base characters and combining marks easily. > This can be inconvenient for certain scripts, and will become > more and more inappropriate because the encoding of precomposed > characters has been stopped. There should be a note pointing > out this problem, and the XML Schema WG should have a plan of > how and when to address this (i.e. the upgrade to the next level > of regular expressions according to Unicode TR #18). > >[45] Several examples could be less US-centric. In particular, the > example in 5.2.11 should be changed from Farenheit to Celsius. > >[46] In appendix A, all prose should fall under xml:lang='en'. > >[47] There are a number of inconsistencies and typos, but given > the large number of needs for changes as discussed above, it > seems more appropriate to check and report such problems on > a second reading after an update. > >Regards, Martin.
Received on Wednesday, 31 May 2000 01:53:50 UTC