I18n comments on XML Schema: Datatypes

Summary of disposition

13 October 2000
Paul V. Biron
C. M. Sperberg-McQueen

This document reproduces the comments made by the W3C i18n WG on the 7 April 2000 last call draft of XML Schema, and provides a quick summary, for each point, of what the XML WG and editors of the spec have done (or in some cases are doing) in response.

Examination of the datatypes spec by the editors and chair showed that some changes which the editors thought had been made to the spec had either not been made or had fallen off at some point; the editors will make these changes as soon as possible, and certainly before any CR publication.

Message-Id: <4.2.0.58.J.20000531142142.03430280@sh.w3.mag.keio.ac.jp>
Date: Wed, 31 May 2000 14:22:49 +0900
To: www-xml-schema-comments@w3.org
From: "Martin J. Duerst" <duerst@w3.org>
Subject: Fwd: I18N Last call comments on Schema Part 2

Forwarded, on request of C. M. Sperberg-McQueen.

Date: Mon, 29 May 2000 18:57:09 +0900
From: "Martin J. Duerst" <duerst@w3.org>
Subject: I18N Last call comments on Schema Part 2

Dear Schema WG,

[This mail is crossposted to the I18N IG to allow for further discussion. Please feel free to forward these comments to another list, including a public list, but please make sure that you don't reveal the mail addresses of the various groups.]

This are the last call comments on XML Schema Part 2: Datatypes from the I18N WG/IG.

The comments are numbered by [n], but their order does not reflect their importance.

[1] The definition of 'match' has been copied from XML 1.0. There are propsals for clarifying XML 1.0. The Schema WG should work together with the XML Core WG and the I18N IG to make sure everything is in sync.

This is part of LC-207

Agreed; will change to agree with the definition in XML 1.0 2e.

[2] The spec says that length for strings is measured in terms of [Unicode] codepoints. This is technically correct, but it should say it's measured in terms of characters as used in the XML Recommendation (production [2]).

This is part of LC-207

Agreed; will change.

[3] In 2.4.2.12, it says 'For example, "20" is the hex encoding for the US-ASCII space character'. It should say something like '"20" encodes a byte value represented e.g. in C as 0x20, which may stand for the space character if US-ASCII (or UTF-8) is used to encode it.' But actually this is a bad example, because encoding text with base64 is a bad idea and is against the spirit of XML.

This is part of LC-207

Agreed; will substitute hex encoding of a 16-bit binary integer 4023.

[4] related to [3]: XML is based on Unicode and therefore allows to represent a huge range of characters. However, XML explicitly excludes most control characters in the C0 range. There are fields in databases and programming languages that allow and potentially contain these characters. A user of XML and XML Schema has various alternatives, all not very satisfactory:

  1. Drop the forbidden characters
  2. Using XML Schema 'binary' with an encoding: This does not encode characters, but bytes, and therefore looses all i18n features of XML. There is a serious danger that this is used even when the data item in question or even the whole database does not contain a single such character.
  3. Invent a private convention

This is a serious problem, and should be duly addressed by XML Schema.

[There is a related problem with respect to names (GIs in SGML terminology), but this is more an XML 1.0 problem than an XML Schema problem, and there is no danger to loose all i18n information just because of a single character.]

This is part of LC-218.

Out of scope, sorry.

Without disagreeing with you on the importance of the problem for those whose legacy (or new) data contains characters which are not legal in XML, we believe that this problem arises as a result of XML's definition of its legal character set, and not as a result of anything in or not in XML Schema. We are required by charter to work with XML 1.0 documents and may not change it whether to mend or to mar.

[5] related to [4]: 3.2.1 seems to allow all Unicode/ISO 10646 characters, this is not true (see [4]).

This is part of LC-207

Good catch, thanks; will change to reference production 2 in the XML spec.

[6] 3.2.1: Expand 'UCS' to Universal Character Set.

This is part of LC-207

Agreed; will do.

[7] Make sure that functionality for locale-independent representation and locale-dependent information is clearly distinguished. This is the only way to assure both appropriate localization (which we consider to be very important) and worldwide data exchange. The specification is rather close to half of this goal, namely to provide locale-independent datatypes. Some serious improvements however are still possible and necessary (see below). It is clearly desirable that W3C also address locale-dependent data representations. We think that these go beyond simple datatyping/ exchange issues and include:

These issues therefore have to be examined on a wider level involving various groups such as the XML Schema WG, the XSL WG, the CSS WG, the XForms WG, and so on, and this should be done as soon as possible.

We would like to repeat that any mixup between locale-independent and locale-dependent data representation will lead to confusion and will hurt, and not benefit, internationalization and localization. (This point is further addressed in detail in some of the points below: [8], [9], [10]-[16], [20]).

This is part of LC-219. See discussion in response on that issue.

As you suggest, this problem appears not to be directly addressable by XML Schema working alone. We believe our spec does not confuse locale-dependent and locale-independent information, if only because it applies neither characterization to any data.

[8] Say explicitly in the specification and in the primer that the lexical representations you provide for various datatypes (in particular things such as date, numbers,...) are designed for locale-independent data exchange, and that they are inappropriate for locale-dependent data representation. In the primer, an example such as <date value='2000-05-16'>Tuesday, 16th of March, 2000</date> (or even just something like <date value='2000-05-16'>next Tuesday</date>) with value defined as a date and the <date> content as string, would help. Also, explicitly warn that where there is some similarity between localized representations and the locale-independent representation, this must not be exploited when presenting the data to a user, and that similarities are due to

and that the fact that some representations are more similar to some locales than others is done reluctantly, and not explicitly to disadvantage certain users. [Indeed, where possible, we would prefer representations that avoid any similarity to any existing locale.]

This is part of LC-219, see discussion there.

[9] As said above and explained below, addressing localized representations as a whole is a huge problem. The one contribution that seems most appropriate and relevant from XML Schema is to associate locale- independent and locale-dependent representations. Taking the example above, <date value='2000-05-16'>Tuesday, 16th of March, 2000</date>, the association between the locale-independent 'value' and the locale-dependent element content is implicit; XML Schema should provide a way to make this association explicit. Including in the association some way to indicate the local format used / the conversion functions necessary seems also desirable, although we are not yet aware of an interoperable way to do so.

This is part of LC-219, see discussion there.

No progress on this in 1.0, sorry.

There is currently no location in the conceptual framework to allow a systematic association between lexical forms representing the same values; the abstract simple types proposal which had this as a goal was defeated owing in part to the negative comments by members of the i18n WG.

[10] Several datatypes have more than one lexical representation for a single value. This gives the impression that these lexical representations actually allow some kind of localization or variation of representation. However, as explained above, such an impression is a dangerous misunderstanding, and has to be avoided at all costs. We therefore strongly request that all duplicate lexical representations be removed. The following points ([11]-[16],[20], [22], [27]) give details for each affected datatype. For each datatype, we indicate where duplicate representations exist, and how it may be removed. Unless otherwise indicated, we do not have any particular preferences of how to remove the duplicates; we just explain one way to do so to allow you to reuse the analysis we (mostly Mark Davis) have already done. We would like to point out that reducing the lexical representations to a single one for each value also makes using digital signatures on such data a lot easier, and to a large extent and at very little cost, avoids the creation of another WG and spec like in the case of XML Canonicalization.

This is part of LC-220.

Agreed in part, in part not agreed.

We have tried to remove duplicate representations of the same value wherever possible; they remain in limited cases in the numerics (e.g. leading zero), in time instance (because of time zones), and in hex (because both lower and upper case are legal). The variation we allow is in general that which we know to be supported by common libraries (so that requiring that it be disallowed would complicate implementation rather than simplifying it).

[11] 3.2.2 'boolean': There are currently four lexical reps. for two values. This has to be reduced to two lexical reps. The I18N WG/IG here has a clear preference:

This is part of LC-220

We have settled on a single representation. Unfortunately, the Anglophones won, and the representation is "true" and "false". (The informal proposal to use "sic" and "non" appears not even to have been minuted.)

[12] 3.2.3.1 'float' allows multiple representations. This must be fixed, e.g. as follows:

Float values have a single standard lexical representation consisting of a mantissa, followed by the character "E" (upper case only), followed by an exponent. The exponent must be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the lexical rules for integer and decimal numbers discussed above[below?]. The absolute value of the mantissa must be either zero, or greater than or equal to 1 and less than 10. If the mantissa is zero, then the exponent must be zero. For example: Valid: "-1.23E5", "9.9999E14", "1.0000001E-14", "0E0", "1E0" Invalid: "+1.23E5", 100000.0E3", "1.0E3", "1.0E0", "012.E3", "0E1" [This leaves one issue open, namely the issue of too high precision. one way to solve this is to define that the lexical rep. chosen is the one with the shortest lexical rep of the mantissa that corresponds to the desired value according to [Clinger/Gay], or if two lexical reps with the same shortest mantissa correspond, then the closer one should be chosen, and if both are equally close, then the one with an even end digit is chosen. [This should cover all cases, but there may be more accurate or more easy to calculate alternatives, and this should be checked by experts.]] [Some people may claim that e.g. the free choice of exponent or the use of leading digits is necessary to be able to mark up existing data; we would like to point out that if such claims should be made, we would have to request that not only such variations, but also other variations, e.g. due to the use of a different series of digits (Arabic-Indic, Devanagari,... Thai,..., Tibetan,..., ideographic,...) and so on be dealt with at the same level.]

This is part of LC-220

We have chosen not to require implementors to check for and reject lexical forms accepted by standard libraries for handling floats.

[13] 3.2.4.1 'double' allows multiple representations. This must be fixed. The solution lined out in [12] can be applied.

This is part of LC-220

(As for previous point: forbidding the remaining variation seems to have a higher cost for implementors than allowing it.)

[14] 3.2.5.1 'decimal' allows multiple representations. This must be fixed, e.g. as follows:

Decimal values have a single, unique, lexical representation. This consists of a string of digits (x30 to x39) with a period (x2E) as a decimal indicator (in accordance with the scale and precision facets), and a leading minus sign (x2D) to indicate a negative number. The decimal indicator must be omitted if there are no fraction digits. Leading and trailing zeros are illegal, except for zero itself (which is written as "0"). For example: Valid: "-1.23", 100000", "12678967.543233", "0" Invalid: "+1.23", 100000.0", "12,678,967.543233", "12,678,967.543233", "0.0", "012."

This is part of LC-220

As for the previous points. We do not accept responsibility for the fact that in standard notation leading zeroes do not change the value of a number, and believe it is more graceful to accept this fact than to try to change or ignore it.

[15] Lexical representation of derived datatypes: The lexical representation of all datatypes derived (directly or indirectly) from 'decimal' (13 types from 'integer' to 'positiveInteger') must be changed to be unique. The easiest and most consistent way to do this is to just specify for each datatype that the lexical representation for all the values of the type is the same as for 'decimal'. If you want to be specific, you can find some details at: http://lists.w3.org/Archives/Member/w3c-i18n-wg/1999Nov/0007.html (members only). In any case, disallowing a '+' (done on some types, but not consistently) and disallowing leading zeroes should do the job.

This is part of LC-220

We can make the lexical representation similar, and have done so. We cannot say the lexical representation is "the same", though, because (for example) the lexical space of decimal includes numbers with the plus sign; the lexical space of negative integer doesn't include any such numbers; such variations need (in our view) to be mentioned.

[16] For elementary types, there may be a desire to allow whitespace around the actual data. To be clear, the spec should explicitly say that this is disallowed. (except for cases where it has to be allowed for XML/SGML conformance, i.e. ENTITY, ID,...). Another way of expressing this comment is to say that the spec should make clear for which datatypes CDATA attribute-value normalization should be chosen, and for which datatypes not.

This is part of LC-207

Agreed (mostly). We have made explicit which types get which kind of whitespace normalization, but the rules are not quite those you propose.

All simple types not derived from string have tokenized behavior: i.e., leading and trailing whitespace is stripped and internal whitespace is collapsed to single blanks. Three builtin types of strings are defined: "string" (which has no whitespace normalization at all), CDATA (which replaces each kind of whitespace characters with a blank, as for CDATA attributes in XML 1.0), and token (which performs CDATA normalization, then strips leading and trailing blanks and collapses sequences of blanks to single blanks).

[17] The time-related datatypes (timeDuration and recurringDuration and derived datatypes) need to be redesigned to avoid a number of serious problems. For details, please see points [18]-[25].

This is part of LC-221

Partly agreed and partly not agreed.

[18] The specification assumes that usual arithmetic can be done with TimePeriod, but due to the representation chosen, this is not the case. For example, it is absolutely unclear which of P3.01M or P90.5D is greater, or whether they are equal. There are two ways to solve this, either to choose a different representation or to remove orderedness and min/maxIn/Exclusive. The former is clearly desirable because of additional reasons, please see [19].

This is part of LC-221

(N.B. We believe you are talking about timeDuration (e.g. "4 seconds"), not timePeriod (e.g. "30 minutes beginning 1:30 Tuesday 17 October 2000").)

The order relation on timeDuration has been a sticky problem for the WG; opinions on how best to model the reality of time durations have varied, as have opinions on what the reality of time durations actually is. We have explored a number of options, including:

Of these, the WG has with a certain amount of resignation concluded that partial order appears the least worst. Perhaps it is only that each of the other proposals has quickly attracted vociferous opponents.

We expect feedback on this topic from implementors and users during the CR period, and we hope that that feedback will help illuminate which of the available choices works best in practice.

[19] The use of culture-specific time length units is highly problematic. This in particular applies to years and months in timeDuration. Various calendars use different month and year lengths; the main distinction being the one between lunar calendars and solar calendars. The Islamic, Hebrew, and Chinese months and years, for example, are all different from the corresponding western units. A system either has to be able to represent these units in all calendars (extremely difficult) or should be limited to representations that are to an extremely high degree culturally neutral. In order to deal with [18], too, we propose to do the later.

This is part of LC-221

Not agreed (if we are correct in interpreting this as a request to make durations like "three months" illegal).

We would like to be able in the long run to support the full range of calendars in wide use, or even sporadic use; in the short run, Gregorian is an important calendar for many (most?) of the applications expected for XML Schema in the near future, and we believe we need to support it. We don't believe that not supporting the Gregorian calendar represents any particularly useful step forward toward supporting other calendars.

[20] Unique representation of timeDuration: There must be only one lexical representation for each timeDuration. This can be achieved as follows: Based on the representation of ISO 8601, only PnDTnHnMnS is used (i.e. no years or months). If any unit is zero, the number and the letter after it are removed, except for the zero duration, which is represented as P0D. If only Days are present, 'T' is omitted. Overflows in lower units have to be converted to higher units (i.e. PT24H -> P1D, PT60M -> PT1H, PT60S -> PT1M; except for leap second cases). Decimal fractions are only allowed for seconds, and do not allow trailing zeroes. [A serious alternative to this would be to remove timeDuration altogether.]

This is part of LC-221

Agreed (though not quite as you have suggested).

This question is related to the problem of the order relation on time durations, and also to the general question of allowing or forbidding multiple lexical representations for the same value.

The value space of time durations has been redefined as a set of n-tuples, rather than as a set of one-dimensional magnitudes. Each lexical form denotes a distinct tuple and thus a distinct value (modulo multiple forms caused by leading zeroes etc.). The duration "24 hours" thus denotes a distinct value from "1 day", because the values are distinct tuples.

Removal of the time duration type just doesn't seem a serious alternative to us. Schema authors would only invent it for themselves, in non-interoperable forms.

[21] The problems with timeDuration ([19]-[21]) heavily affect recurringDuration and all datatypes derived from it. In addition to the arguments above, recurringDuration is clearly of verylimited use even for the areas of the world that use the Gregorian calendar for all their activities. Being able to specify e.g. the 5th of May every year is only of limited value; most events are decided according to a much more complex pattern. The 3rd Wednsday of each month, a certain date if it is not a Sunday, otherways the Monday after it, and so on, are easier examples, and things can get more complex. With the current solution only a small part of the actual requirements can be addressed. Therefore, the datatype 'recurringDuration' must be removed. Several derived datatypes will be removed as a consequence (e.g. timePeriod, recurringDate, recurringDay, time,...). [The only viable alternative to this is to work on a more powerful representation can can address both various cultures and more complicated rules.]

This is part of LC-221

Not agreed; not removed.

We agree that recurringDuration is of limited (i.e. finite) use. But we think it's of sufficient use to be worth including.

[22] Having a datatype for timeInstant is clearly desirable. The current derived type should be promoted to a base type. Ideally, the representation should be based only on days (and seconds within the day) from an arbitrary but clearly specified base time instant (this would greatly simplify conversions to internal representation of all kinds of OSs and libraries). If this is judged to be not enough readable in plain text, the current scheme based on ISO 8601 may be kept (but should be verified to be absolutely clean of double lexical representations). Please note that while the representation in this case would not be culturally neutral, each timeInstant can with appropriate calculations be represented in a different calendar without problems.

This is part of LC-221

Agreed that the type is desirable; not agreed on the lexical form.

Legibility is indeed an issue; so is representation of time instants before the epoch used for a second or millisecond count. (We can invent a rule, but it's not clear that all systems which count seconds from an epoch will understand, which seems to mean a little less ease of conversion than might be hoped.)

The only synonymous lexical representations are those required by allowing time zones to be specified.

[23] It may be reasonable to consider a datatype 'date', which is related to timeInstant but most probably best defined as a separate base type. 'month', 'year', and 'century' have to be removed for the reasons given above. It may be worth defining a 'composite' datatype 'actualTimePeriod', which consists of a start timeInstant and an end timeInstant. This would cover a lot more (and a lot more useful) cases in a much more uniform manner than what is currently possible, and could even replace 'date'.

This is part of LC-221

We don't understand. We do have a date type.

The proposal appears to be to make date a primitive type instead a derived type, to reflect its importance. But the distinction between primitive and derived types is not a measure of their practical importance.

The WG did discuss having a type analogous to the actualTimePeriod you propose (and to the corresponding ISO 8601 notation). We chose not to, because such a representation is a pair, probably best represented as the value of two attributes or two elements, and because the value space would be identical to that of timePeriod (defined by a starting time and a duration), and would thus introduce multiple lexical spaces for the same value space in a way that seemed problematic and unmotivated.

[24] ISO 8601 is based on the Gregorian calendar, but there seems to be no indication as to whether this is applicable before 1582, nor how exactly it would be applied. Also, it is unclear how far into the future the Gregorian calendar will be used without corrections. A representation purely based on days and seconds would avoid these problems; if this is not possible, then the spec needs some additonal explanations or references.

This is part of LC-221

The Gregorian calendar is indeed applicable before 1582 and after 9999; its applicability is independent of its use as a civil or ecclesiastical calendar.

Appendix D now does say clearly that the calendar is applicable before 1582 and after 9999.

[25] Several details in appendix D have to be fixed. It has to be clear that leading zeroes for months and days are needed. Hours obviously go from 0 to 23, minutes from 0 to 59. Seconds indeed can go to 60 in the case of leap seconds, but only in that case.

This is part of LC-221

Oops. Agreed; will be done.

[26] For international data interchange, a uniform way to transmit measurements not only for time lengths and time instants, but all kinds of other units, seems highly desirable. If this cannot be provided in the first version of XML Schema, it clearly should be taken up soon for the next version.

This is part of LC-221

Measurement units are a perennial favorite topic of many in the WG. They did not seem to the WG to belong to the minimum needed to declare victory and get 1.0 out the door, but it is inescapable that they will come up again in work on future versions.

[27] The lexical representation of 'hex' encoding (2.4.2.12) must be changed to allow only one case (e.g. only upper case) for hex digits.

This is part of LC-220

Not agreed.

The standard libraries we know all support both upper- and lower-case; requiring a case-fold or case-check operation seemed to be gratuitous rigor.

[28] String length: There should be a note saying that string length as defined here does not always coincide with string length as perceived by the user or with an actual amount of storage units in some digital representation, and that therefore care should be taken both when specifying some bounds as well as when using these bounds to try to derive some storage requirements. [Although this is not an i18n issue, our group also found the simultaneous availability of 'length', 'minLength', and 'maxLength' highly confusing.]

This is part of LC-207

Agreed; will be done. N.B. the choice of the word "codepoint" over "character" (see your point [2]) was motivated in part by a desire to suggest that what was being measured was something distinct from the culturally familiar objects people think of when they hear the word "character". So the warning is even more important now.

About length: several other people have also suggested dropping it, since it's just a (formally redundant) short-hand. We may drop it eventually, but there were other changes that seemed more urgent and necessary.

[29] String ordering: This feature seems to be present for no real use, and should be removed. User-oriented string ordering is highly complex and locale-dependent, and is dealt with in other standards (ISO/IEC 14651 and Unicode TR #10). Locale-independent ordering only makes sense if it is usable for something. This may be actually the case if it were possible to specify that all subelements of a given element have to appear in a given order (just to avoid variation). If this is possible with XML Schema, the orderedness of string may be kept. If not, orderedness as a facet should be removed altogether. In any case, the related facets min/maxIn/Exclusive must be removed, because they never lead to any useful subset of strings. (E.g. assume minInclusive='a' and maxExclusive='b'. This makes sure the first letter is a lower case 'a', but allows any letter whatsoever (from the whole Unicode repertoire) after the 'a'. This is most probably not what a naive user is expecting (but as good as we can get), and for an advanced user, this (and many other useful things) are much easier specified by patterns).

This is part of LC-207

Agreed: string is now described as not ordered (at least, not ordered by any order relation defined by XML Schema).

[30] URI Reference: This definition must be changed to allow for characters not allowed in URI References, in order to be in accordance with the relevant section of the W3C Character Model (http://www.w3.org/TR/charmod/#URIs) and all the W3C Recommendations and upcomming Recommendations in accordance with it (HTML 4.0, XML 1.0, RDF, XPointer, XLink,...). [While at it, please also remove the definitions of 'absolute uriReference' and 'relative uriReference' if you don't use it, and make sure you mention that RFC 2396 has been updated by RFC 2732: Format for Literal IPv6 Addresses in URL's R. Hinden, B. Carpenter, L. Masinter, December 1999. e.g. at http://www.ietf.org/rfc/rfc2732.txt]

This is part of LC-207

No change made, sorry.

Our local URI expert (Dan Connolly) took a slightly different interpretation of the value space of URIs and URI references. Following his interpretation, the change suggested appears to involve agreeing with the character model WD against the RFC, which the WG was unwilling to do.

Even those who would have preferred to make the suggested change felt that until the character model is a Recommendation or at least a CR, we cannot cite it normatively.

[31] 3.3.1 language: The 'LanguageID' production in XML 1.0 is too narrow. It fits the currently allowed languageIDs of RFC 1766 tightly, but RFC 1766 is being upgraded (see http://search.ietf.org/internet-drafts/draft-alvestrand-lang-tags-v2-01.txt). The I18N WG/IG are working together with the XML Core WG to make sure XML can be adjusted appropriately, and that no premature overly restrictive decisions are taken. The XML Schema WG should work together with the above WG to coordinate this issue.

This is part of LC-207

Agreed in principle.

We now cite the section, not the production. Which is good, since the production has been deleted from XML 1.0 2e. We think we are in agreement with you and with 2e.

[32] The 'length/minLenght/maxLength' facets on 'language' are highly doubtful; they do not correspond to any useful concepts in the value domain of this datatype.

This is part of LC-207

Agreed that they're not particularly useful, but no change made, sorry.

Since the language type is based on the string type, and string has a length facet, language has to have it; there is no machinery to turn facets off and adding it would complicate matters without bringing any real advantage. We agree that it would be eccentric for a schema author to use them (perhaps, though, one might wish to require that locale codes always be included, which could be done with minlen=5?).

[33] It is unclear why certain datatypes are derived from 'string' (e.g. language, nmtoken, name, ncname), but not others (e.g. ID, idref, entity, notation, qname).

This is part of LC-207

Point noted.

These design decisions could have gone the other way. We chose not to derive ID, etc., from string because they have validation semantics and uniqueness constraints which go beyond questions of the lexical space (and, as some WG members see it, the value space -- but the value space of ID and similar types is a vexed issue). We are trying to minimize the use of arbitrary magic in the spec, and in particular trying to have our derivations follow the same rules as a schema user's derivations. Schema users cannot add the kinds of validation constraints imposed by (some of) these legacy types, as part of a derivation step, and so we preferred not to, either. Instead, we made the relevant types distinct primitive types, and applied the magic there (all primitive types are magic in this sense).

NMtoken and so on, by contrast, as types express purely lexical constraints on their values, and so they can be derived from string without appeal to magic of any kind.

[34] Pattern combinations: Section 5.2.4 says that multiple patterns in a derivation of a single type are combined as if they were separate branches of a regular expression. Branches result in an 'OR' combination, i.e. the actual string can conform to either branch. It seems much better to change this to an 'AND' condition, i.e. the actual string has to conform to BOTH regular expressions. There are several reasons for this:

This is part of LC-207

Multiple patterns within a derivation step combine with OR by analogy with enumerations.

Multiple patterns in separate derivation steps combine with AND. So if we understand correctly what you're looking for, you can do what you need in multiple derivation steps.

[35] It has to be possible to specify various restrictions on a string simultaneously. In particular, we expect that combining a restriction regarding the character repertoire (e.g. to deal with encoding restrictions in legacy systems) and a restriction on the structure of a string will be quite frequent. See also point [36].

This is part of LC-207

The various restrictions are possible, but not the simultaneity (if by that you mean all the restrictions have to be expressible in the same derivation step).

When we were told that repertoires are not currently standardized or candidates for standardization, many WG members lost their appetite for explicit support for them in XML Schema. Since repertoires can be defined by means for regexes, the effect you describe can be achieved simply:

[36] Some of the regular expressions needed will be quite long. As an example, the regular expression to limit the repertoire to those characters expressible in the traditional Japanese encodings results in a character class with about 6000 characters. To make this reasonably possible, we suggest:

This is part of LC-207

Your third point applies (patterns in different steps of the derivation hierarchy are indeed combined with AND). There is some sentiment for allowing whitespace in regexes in a later version of the spec.

Adding a nameable component to represent regex patterns just seems too complex. It would for instance rather complicate the description of expressions and their interpretation. (If you really need to have named patterns, use general entities in the schema document.)

[37] In appendix E, remove the 'CS Surrogate' character property. Surrogates do not appear on the level that XML Schema is working.

This is part of LC-207

Agreed. Will remove the property, and add a note that this one property is not usable.

[38] For character sequence '\w', please make sure that the character class does not end at ￿ but at 􏿿, and that this is consistent in the primer.

This is part of LC-207

Agreed; oops. Will change. Sorry; thought it had been done.

[39] Upgrade the reference to ISO 10646 to the year 2000 version, removing the reference to the amendments.

This is part of LC-207

We are going to try to duck this problem if possible. We will cite the XML spec normatively, and add non-normative citations to the Unicode and 10646 versions cited by XML 1.0 2e.

[40] Upgrade the reference to Unicode to version 3.0.

This is part of LC-207

See preceding point.

[41] Make sure whether/that block escapes are normative (i.e. change the various 'may' in their definition to something more appropriate).

This is part of LC-207

Agreed; will be done.

[42] Try to give less US-centric examples.

This is part of LC-207

Point taken. But our brains are currently fried by the effort of getting the CR draft done. If you can make specific suggestions, please send and we'll incorporate them if possible.

[43] Make sure that the character property categories and block escape classes for Unicode characters are not bound to a single version of Unicode. This would create an update problem as soon as Unicode is updated, which is sure to happen rather soon. XML Schema should be independent of such upgrades, otherwise this part of it will soon be less and less useful. The pointer to version 3.0.0 of the Unicode Database should be changed to a generic pointer to the latest version.

This is part of LC-207

We are going to duck this one as far as possible. Our normative reference is via XML 1.0 2e. We do not want to make any normative reference (direct or indirect) to an unversioned spec: it's guaranteed to lead to interoperability problems, and it is (as some WG members say) buying a pig in a poke. We will add a note encouraging implementors to provide access to new versions of the db, as well as to that prescribed by the normative reference(s) in XML 1.0.

[44] The current regular expression syntax does not take into account combinations of base characters and combining marks easily. This can be inconvenient for certain scripts, and will become more and more inappropriate because the encoding of precomposed characters has been stopped. There should be a note pointing out this problem, and the XML Schema WG should have a plan of how and when to address this (i.e. the upgrade to the next level of regular expressions according to Unicode TR #18).

This is part of LC-207

We agree that this is an issue for version 2.0; we will add the note.

[45] Several examples could be less US-centric. In particular, the example in 5.2.11 should be changed from Farenheit to Celsius.

This is part of LC-207

Agreed; will change.

[46] In appendix A, all prose should fall under xml:lang='en'.

This is part of LC-207

Agreed; will change.

[47] There are a number of inconsistencies and typos, but given the large number of needs for changes as discussed above, it seems more appropriate to check and report such problems on a second reading after an update.

This is part of LC-207

Ah. Part of our Persian-rug strategy ...

Regards, Martin.