- From: Fred Zemke <fred.zemke@oracle.com>
- Date: Tue, 14 Nov 2000 12:38:40 -0800
- To: www-xml-schema-comments@w3.org
Comments on XML Schema Part 2: Datatypes Candidate recommendation dated 24 October 2000 submitted by fred.zemke@oracle.com These are comments generated on my first read of the document. I am sorry if some of the comments reflect my ignorance. Any such comments might still be useful as signposts for areas that could be misunderstood. 2.3.1 Canonical lexical representation The point should be made that the choice of canonical lexical representation is not actually part of the type machinery. For example, there is no statement of how to derive the canonical representation of a derived type from the canonical representation of the underlying types. To do so would be very difficult. For example, I might derive a type from float with a pattern that specifically forbids the canonical representation of float. My pattern might require a leading or trailing 0, for example. In that case, what is the canonical representation of the derived type? Another example: the order of listing types in a union is significant because the earlier types occlude the later types. It may happen that the canonical representation of an earlier type t1 occludes the canonical representation of a later type t2, yet there exists a noncanonical representation of a t2 value that is not occluded. Thus the canonical representation is something that the document chooses to call out for each type, but users should not expect that all types will have canonical representations. Also, there is no burden on the user to use canonical representations (when available). 2.4.1.1 Equal Initially this section presents Equal as a property of a value space. Thus in the penultimate paragraph you present Equal as a predicate of two arguments drawn from the same value space. The last paragraph conflicts with this vision, because the last paragraph assumes that there is a notion of Equal not tied to any single value space. Obviously such a notion cannot be a property of a value space and hence cannot be a fundamental facet of a value space. I think the answer is that the final sentence is confounding the notion of the facet called equality (which pertains to a single value space) with the mathematical notion of equality (which should be regarded as part of the metalanguage being used to define the type system without actually being part of the type system). Thus the final sentence is trying to make a statement about the mathematical disjointness of value spaces. Perhaps you can clarify this by inserting something like the following: "The operation Equal can be viewed as a restriction of the mathematical notion of equality to a particular value space. Thus we may also speak about equality or inequality between elements of different value spaces." As for the last paragraph, consider the types float, double and decimal. These types are not related by restriction, yet I think that you want your metanotion of equality to say that 1.0 the float equals 1.0 decimal. 2.4.1.2 Order I point out numerous places where you propose nontotal orderings for types. Perhaps instead of "total order" you mean "partial order". 2.4.1.3 Bounds Your definitions of bounded above and below are acceptable for finite value spaces but not for value spaces that mathematicians would call open, for example, the open interval between 0 and 1 is bounded in both directions yet has no upper bound or lower bound in your sense. Of course the real number line is not an XML value space, but the value spaces of float and double is a very similar. While it is theoretically true that the set of float values less than 1 has an upper bound, namely the largest float value just smaller than 1, it will be very difficult for the user to specify it. You have recognized this issue by providing the minExclusive and maxExclusive facets, probably for this very reason. But in those definitions you mistakenly say that the attribute value is the upper or lower bound, which it is not (according to your definition). I think the solution is to follow the example of mathematics with definitions such as the following: [Definition:] If S is a subset of a value space V, then an upper bound of S is a value v in V such that s <= v for all v in V. [Definition:] If S is a subset of a value space V, then the least upper bound of S is the value v in V such that v is an upper bound of S, and for all upper bounds v2, v <= v2. [Definition:] If S is a subset of a value space V, then a proper upper bound of S is a value v in V such that s < v for all v in V. [Definition:] If S is a subset of a value space V, then the least proper upper bound of S is the value v in V such that v is a proper upper bound of S, and for all proper upper bounds v2, v <= v2. and then in the definition of maxInclusive, reference least upper bound, while in the definition of maxExclusive, reference least proper upper bound. You will also need to address bounds for types that are not totally ordered (for example, NaN in float and double is not comparable to anything, I believe). I have more to say on this under 2.4.2.7 maxInclusive. 2.4.1.4 Cardinality The second sentence says that some value spaces are uncountably infinite. While this can occur in mathematics, you have not defined value space as a mathematical object; instead it is "the set of values for a given datatype" and it is demonstrable that all XML datatypes have countable set of values, since they are all representable by finite strings over a finite character set. The possibility of uncountably infinite value spaces is immediately discounted in the next few paragraphs. 2.4.2.5 enumeration Are there any plans to enable a user-defined ordering of an enumeration type? For example, the enumeration "requirements specification design implementation test maintenance" of the life cycle of a software product has a nice order, and people way wish to define restrictions of ordered enumerations by using bounds. 2.4.2.7 maxInclusive, 2.4.2.8 maxExclusive, 2.4.2.9 minInclusive, 2.4.2.10 minExclusive, I can see two ways to go in terms of supporting bounds for partially ordered datatypes. One way is to say that the derived type consists of all values of the source type that are <=, <, >= or > the specified bound (depending on the kind of bound). Thus all values in the derived type will be comparable to the bound. The other way is to say that the derived type consists of those values, and also any values that can not be compared to the bound. Either kind appears useful. So you could have eight facets instead of four. Or you could introduce one new facet to indicate whether incomparable values are included or excluded. Right now the latter appeals to me, mainly because I can't think of any good names for the eight facets. So I am suggesting a facet, perhaps called includeIncomparables, which might be a boolean, where true means that the incomparables are included and false means that they are not. I have no idea what the default for this facet should be. 2.5.1 Atomic vs. list vs. union It is not specified how to derive many of the aspects of a derived type from the information supplied by the base type and constraining facets. I am thinking of the following: equality, order, lexical representation, canonical representation. I will consider ways to specify these in each of the three subdivisions of this section. 2.5.1.1 Atomic datatypes. For convenience I discuss restriction derived types here, though they are not mentioned in the heading. It is probably fair to assume that when a type is derived by restriction, then the equality, order and lexical representation are simply the restriction (in the mathematical sense) to the smaller value space. This could be stated explicitly. As for canonical representation, note that the restriction may involve a pattern that forbids the canonical representation in the source type. For example, derive a type from float using a pattern that requires a leading or trailing zero. In that case no canonical representation in the source type is available in the derived type. This looks like an insoluble problem to me. I conclude that not all types have a canonical representation. 2.5.1.2 List datatypes In the case of a list type derivation, I will assume that equality is defined to require that two lists s1 and s2 have the same number of list elements, and that the elements are pair-wise equal in order left to right. 4.1 "Datatype definition" says that lists are unordered. This is acceptable, though it is not hard to induce an ordering on lists if the list element type is ordered. The lexical representation has been stated somewhere to be a string which is the space-separated concatenation of representations of elements of the base type that do not contain embedded spaces. You do actually define the canonical representation of list datatypes, in the obvious way. Of course this only works when the item type has a canonical representation. 2.5.1.3 Union datatypes Union derived type is the trickiest. I conjecture that the sentence "During validation, an element or attribute's value is validated against the participating types in the order in which they appear in the definition until a match is found" is crucial. Thus to define equality, first establish the types of each value. If these types are comparable, then compare them and that gives the answer. If the types are not comparable, then the values are not equal. To define order for a union type, all the participating types must be comparable (my term, admittedly only partially defined in these comments), in which case the common ordering of these types is used. If the participating types are not all comparable, then it is simplest to say that the union type is not ordered. This appears to be the verdict in 4.1 "Datatype definitions" in the paragraph "If {variety} is union then...". As for canonical representation, this will be difficult to define for unions because the first member type may have canonical representations that occlude canonical representations in the second member type. For example, let the first be strings whose pattern is the same as the canonical representation of a float. Then for example 1.1e0 will be found to be of the first type and 0.11e1 will be found to be of the second type. There is no canonical representation for 0.11e1 in this union type. My conclusion is that you should explicitly say that this standard does not define a canonical representation for unions. 3.2.2 boolean is this type ordered (perhaps true > false)? 3.2.2.2 constraining facets (of boolean) If you allow pattern, you might as well allow enumeration, which will provide a more obvious way for people to indicate that only true or only false is permitted. 3.2.3 float first paragraph, last sentence, appears to be missing the word 'and' in the phrase 'positive negative infinity'. Your definition would seem to say that 0 = -0. I think you should say this explicitly because the statement that 0 and -0 are "special values" might lead to the conclusion that -0 < 0. Or maybe I'm wrong and -0 < 0, which would still illustrate that you need to make an explicit statement about this. The ordering is not total. I assume that -INF is the least value and INF is the greatest value, but where does NaN fit in the ordering? 3.2.3.2 Canonical representation (of float) This does not say whether a plus sign is permitted in the exponent of a canonical representation. Presumably they are forbidden, the same as in the mantissa, but this should be stated explicitly. I assume that 0E0 is the canonical representation of the special value -0. This should be stated explicitly. 3.2.5.2 canonical representation (of decimal) This does not say whether a decimal point is required or forbidden when representing an integer. 3.2.6 timeDuration The computation of end timeInstant t[s] is not well-defined because the phrase "handling the carry-overs correctly" is not completely specified. For example, what is Jan 30 plus one month? Ed. Note: you asked for feedback to solve this problem. The solution in the SQL standard is that there are two kinds of durations (SQL calls them interval types), namely year-month and day-time. These two types are regarded as incompatible; they may not be combined into a single type that represents, for example, one month plus one day. As for the thorny issue of adding years or months to arrive at an invalid date (e.g. Feb 30) SQL declared such operations to be runtime exceptions. This has the virtue of being intellectually pure and the disadvantage that raising exceptions is not user-friendly. Perhaps a better solution would have been to provide functions enabling the user to choose what to do with these cases (round up into the next month, round down to the last valid day of the month, raise an exception, etc.) When writing the OLAP Amendment to SQL:1999, the issue came up again (in a single restricted context) and it was decided that when date arithmetic arrived at an invalid date, use the last valid date of the indicated month instead of raising an exception. 3.2.6.1 lexical representation (of timeDuration) second para: "The values of Year, ... allow an arbitrary integer." I assume you mean nonnegative integer, since it seems that the sign must prefix the entire string. (We are talking about the lexical representation here. The components that are encoded may be negative, but you must use positive values to denote those negative components, with a single sign in front of the entire string.) "Similarly the Second component allows an arbitrary decimal." I think you mean an arbitrary nonnegative decimal. In the discussion of truncated representations, the second bullet says "The lowest order item may have a decimal fraction". Please clarify how this works. Since this is called a truncated representation, presumably the intent is that it is a shorthand for something that has a full representation. But I don't understand what that would be for P0.3Y or P0Y0.5M. Of course one way to make this work is to allow fractions in all components, not just seconds. In that case though you should permit such things as P0.3Y0.5M. Probably a simpler solution, though, is that a decimal fraction is allowed in the lowest order time component, but not in Year or Month. no canoncial representation is defined. Presumably this can be achieved by saying that each numeric component is put in its canonical representation (after factoring the minus sign to front of the entire string if any). 3.2.7 recurringDuration The first paragraph is a vast improvement over the previous text that I saw (dated Sept. 22). Second para, second sentence: "These facets specify the length of the duration..." but length is a word with a reserved meaning in this document. Granted you did not make a hot link back to the definition of length, but still it would be better to avoid this word, or at least clarify it, for example "length (i.e., quantity of time, not lexical length) of the duration" Second para, third sentence: "The lexical format used to specify these facet values is the lexical format for timeDuration". Does this imply that the datatype of these facets is timeDuration? It seems that the more correct way to specify this would be to say that the datatype of the facets is timeDuration, from which it would follow that their lexical format (as well as the value spaces) would be those of timeDuration. Next two sentences: "A value of 0 for the facet period ... . A value of 0 for the facet duration...". But 0 is not in the value space of these facets. Probably what is meant is P0Y (more fully, P0Y0M0DT0H0M0S). Are negative values for duration and period permitted? I conjecture that they are not, but later I offer possible meanings for them if they are. If they are not, then you can further qualify the datatype of the facets by saying that minInclusive is P0Y0M0DT0H0M0S. Is it permitted for duration > period > P0Y? For example, a duration of one day recurring every hour. This would result in a set of overlapping 24 hour periods (a 24 hour period beginning at midnight, a 24 hour period beginning at 1 AM, ...). Are the periods envisioned as open or closed segments of the time line? I guess closed segments because it is clear that you want a single instant as the limiting case when the duration facet goes to P0Y. However people who work with temporal information usually prefer closed-open segments (closed on the lower end, open on the upper) because it is possible to partition the time line into disjoint closed-open intervals, something you cannot do with closed intervals unless you have a granular model of time. Since you have not set a maximum number of places after the decimal point in seconds, it seems that you are avoiding a granular model. Thus you may want to consider specifying closed-open intervals. The downside is that then you will have a discontinuity at duration P0Y, since in set theoretic terms you'd like to define the closed-open interval as {x | origin <= x < origin + duration} which is the empty set when duration = P0Y, rather than a single instant as desired. So I guess that you have closed intervals in mind, mathematically {x | origin <= x <= origin + duration}. Another possibility is you could say that you are not taking a position on whether the end points of intervals are in or out, except when length is P0Y then the endpoints are in. Something that is unclear is whether the sequence of periods radiates both forwards and backwards in time from the origin, or just forwards. I think the words "starting from a specific point in time" hint that the direction is forward only. On the other hand, the type time, derived from this type, seems to be thinking of a radiating both forward and backward. A problem, though, is that if the set of intervals radiates forward and backward from the origins, then the origin is really just an arbitrary representative of any interval's lower end point; there is no basis for distinguishing any particular origin. For example, if the recurringDuration has period 1 year (P1Y) and the origin is January 1, 2000, then I could equally well pick January 1, 1999 as the origin since both sets of intervals are the same (being January 1 in every year throughout eternity). Thus if you intend that the set of intervals radiates both forwards and backwards in time, I think you should say that there is no order relation unless period is P0Y (i.e., there is only a single instant or closed interval). In that case your order definition makes sense. At this point I see that there may be a use for negative duration and period values. Negative duration would indicate the "specific point of time" is the upper end point of the interval rather than the lower end point. Negative period would indicate that the set of intervals radiates backwards in time rather than forwards from the origin. 3.2.7.1 Lexical representation (of recurringDuration) First para: Must each of the fields CC, etc., be exactly two digits? This would seem to be mandatory for CC and YY, otherwise 20 could be interpreted as 0020, 0200 or 2000. What about the others, where punctuation would disambiguate? you say that more than three digits of fractional seconds precision may be specified. What about less than three digits? If there are no fractions of a second, may the decimal point be omitted? It is unclear whether there must be a time zone, or can the time zone be entirely omitted? The first paragraph does not even mention a time zone. Note that the first paragraph second sentence says "The lexical representation is..." as if it were giving the complete specification. Then the second paragraph seems to say that you may add a suffix for the time zone. Finally, under canonical representation, you say that a final Z is mandatory. If the time zone can be omitted, it is unclear whether that means that the datum is zoneless or has a default time zone. To illustrate the difference, consider the statement "I eat lunch at 12 noon". The statement is irrespective of time zone, and means noon at whatever time zone I am in; it does not mean 12 noon UTC when I am in San Francisco. People sometimes refer to this as common time. By the way, the SQL standard has separate types called TIMESTAMP WITH TIME ZONE and TIMESTAMP WITHOUT TIME ZONE, the latter to represent common time. last para, last sentence: "other derived datatypes date, time, timePeriod and recurringDate use truncated versions of this lexical representation". Wait a minute! Since when do derived types get to have a lexical representation that would be unintelligible in the source type? How do I as a user do this with my own user-defined types? I have not seen any machinery for that. If there is no machinery for the user to define an alternative representation for a derived type, then I do not think that these are really derived types. To me, a derived type means that I just compile a schema definition and the derived type works right out of the box. Later I propose an incomplete solution, which I call transform types. 3.2.7.2 Canonical representation (of recurringDuration) If you allow days, months, hours, minutes or seconds to be single digits, you must state whether the canonical representation is two digits with a leading zero if necessary. 3.3.11.1 lexical representation (of integer) May an integer be represented with a final period? Since this type is the restriction of decimal with scale = 0, it would seem to allow both "1" and "1." to represent the same value. Yet this text seems to disallow the final period. If that is the intent, then the integer type also has a pattern constraint to prohibit a final period. 3.3.25 time It is unclear what are your intentions for ordering this type. The first sentence says "time represents an instant of time that recurs every day". Since your lexical representation does not specify a concrete origin on the time line (ie, what day the time instant is in), it seems that you intend a value of type time to represent the set of time instants, spaced periodically every 24 hours, both forwards and backwards in time. In that case there is no inherent reason for calling one time before or after another. For example, is 23:00 before or after 01:00? If you go by the digits, you might say 23:00 > 01:00. But from another perspective, 23:00 is just 2 hours short of 01:00, and from that perspective you would think 23:00 < 01:00. As for 00:00 vs 12:00, these are equally spaced on the dial and I don't see a reason to call either one before or after the other. 3.3.25.1 lexical representation (of time) There appears to be some magic going on here because the lexical representations that are proposed here are not valid representations of recurringDuration. This violates the presumption that built-in derived types are defined with precisely the same machinery that is available to user-defined derived types. When I look at the definition of the type in appendix A "Schema for Datatype Definitions (normative)" there are no attributes or elements that would explain the change in lexical representation. Thus it is not true that an implementation can define the primitive built-in types and simply compile the definition of the derived built-in types. On the positive side, this may indicate the way to a useful extension of XML Schema. What appears to be needed is a way to define a derived type by means of a transform of the content to the lexical representation of another type. In the case of time, the transformation is to prefix the year month and day (say prefix 0000-01-01T). More generally, I think you should allow transformations that are defined as a concatenation of literals and substrings of the the content, using a regular expression to analyze the content into substrings. Here is a straw man proposal: A transform type involves three facets, for the base type, a pattern, and a transform. The pattern is analyzed as a regular expression. The syntax for regular expressions will need to be enhanced with one more metacharacter that serves as a separator to divide the pattern into substrings, for example, ! . This metacharacter will not have any meaning aside from its use in the pattern facet of transform types. The transform facet is somewhat like a regular expression, in that it has one metacharacter. I will use & as the metacharacter, but the choice is arbitrary. The sequence &1& represents the first parameter, &2& represents the second, etc., and && represents a single literal ampersand. Before giving an algorithm I will give an example. Suppose you want a type that allows the user to represent dates in the US fashion (month slash day slash year). The definition might be <simpleType name="USDate"> <transformation base="date" pattern="p{N}+!/!p{N}+!/!p{N}+" transform="&5&-&1&-&3&"> </simpleType> In this example, the pattern facet says that the content must consist of a string of digits, a slash, another string of digits, another slash, and a final string of digits. These five components are called respectively &1& for the first string of digits, &2& for the first slash, &3& for the second string of digits, &4& for the second slash, and &5& for the final string of digits. For example, given <USDate>10/22/1999</USDate>, &1& = 10, &2& = /, &3& = 22, &4& = /, &5& = 1999. The transform facet says to assemble a new string from these parameters. Thus <USDate>10/22/1999</USDate> is reassembled as <date>1999-10-22</date>. The algorithm to transform the content is as follows: 0. Let p be the pattern facet, t the transform facet and c the content to be transformed. 1. The pattern facet is split into n substrings at each occurrence of the separator metacharacter (!), forming subpatterns p1, p2, ..., pn. 2. Each subpattern must be a valid regular expression. 3. Assemble a new regular expression q = (p1)(p2)...(pn). Note that q1 does not have any instance of the separator metacharacter and so it is a conventional regular expression. The role of the parentheses is to respect the boundaries of the subpatterns so that the concatenation does not produce a regular expression that was not intended. 4. The content c must match q treated as a regular expression. 5. Let c1 be the shortest initial substring of the content c such that c1 matches s1 and the remainder d1 matches (p2)... (pn). By remainder I mean the portion of c that is left after deleting the initial substring c1. 6. Let c2 be the shortest initial substring of d1 such that c2 matches p2 and the remainder d2 matches (p3)...(pn). 7. In general, repeat step 6 iteratively until the content c has been analyzed into substrings c1, c2, ..., cn such that c = c1c2c3...cn, c1 matches p1, c2 matches p2, ... cn matches pn. (By the way, you can not simply take step 7 as the definition of substrings c1, c2, ... cn, because in general there can be more than one way to parse a string into substrings to match a regular expression. The algorithm presented works left to right within the string, identifying the first acceptable substring to match each subpattern.) 8. In the transform facet, each occurrence of &1& is replaced by c1, each occurrence of &2& by c2, etc. 4.1 Datatype definition second paragraph under the box, last sentence: "If {variety} is union then the value space of the datatype defined will be the union of the value spaces of each datatype in {base type definition}." More precisely, it will be some subset of the union of the value spaces of the participating datatypes. The reason is that the lexical representation of a value in one datatype may be the same as the only lexical representation in another datatype. For example, consider a union of integer and string in that order. Then "1" will be the lexical representation of an integer and not a string. Consequently there is no lexical representation for the string "1" in the union value space. By the definition in 2.2, every element of a value space must have a lexical representation. Since the string "1" has no lexical representation, it cannot be regarded as part of the lexical space. sixth paragraph: "If {variety} is atomic then bounded is true and cardinality is finite if ...". This is a confusing sentence construction because it is not immediately clear what is the precedence or scope of each logical operation. Using parentheses, some of the choices are: a) "(If {variety} is atomic then bounded is true) and (cardinality is finite if...)" b) "If ({variety} is atomic) then ((bounded is true) and (cardinality is finite if...)" c) "If ({variety} is atomic) then ((boudned is true and cardinality is finite) if ...) As a start on avoiding this problem, never use "trailing if" as in "it is warm if the sun shines", always use "leading if" and explicit "then" as in "If the sun shines, then it is warm". anyway, I think what you are trying to say in the sixth paragraph is "If {variety} is atomic then: if one of minInclusive or minExclusive and one of maxInclusive or maxExclusive are among {facets}, then bounded is true and cardinality is finite. Otherwise, If {variety} is atomic, then bounded is false and cardinality is countably infinite. If {variety} is atomic, numeric and ordered are inherited from {base type definition}." I raise the following counterexamples to these rules: 1. boolean is finite but evidently unbounded since no ordering has been proposed for boolean. 2. for float and double types, what about using INF, -INF or NaN as bounds? 3. float and double have finite cardinality, even when there are no bounds. Thus these types and their derivatives are all bounded. 4. decimal is bounded and has finite cardinality if a precision is specified, even if no bounds are specified. 5. some, but not all, patterns will restrict any type to finite cardinality. 6. string and binary types restricted to a specified length or maxLength are finite cardinality though not bounded since there is no ordering. seventh para, "If {variety} is list then...". this paragraph could also stand to be rearranged to bring all the if's to the front. I think the paragraph is trying to say the following: "If {variety} is list then if length or both minLength and maxLength are among <facets> then bounded is true and cardinality is finite. Otherwise if {variety} is list then bounded is false and cardinality is countable infinite. If {variety} is list, then numeric and ordered are false." Assuming that is what you are trying to say, here are some counterexamples: 1. You don't need both minLength and maxLength in order to conclude that bounded is true: minLength="0" can be regarded as implicit. 2. Even if bounded is true, you can still have countable infinite value space because the bound is only counting the number of items in the string, it is not bounding the values of those list items. For example, there are countably infinite lists of string with list length 1. 3. It is possible to define a type so restrictive that it is empty. In that case there are no lists of that type, so that even if the list length is unbounded, it does not matter, the value space is still empty and not countably infinite. However you are basically right that if a base type is nonempty then unbounded strings over that type will be countably infinite. 4. But don't forget that list types can have the enumeration facet, which I believe will force the value space to be finite. E. Regular expressions First table, under definition of regular expression. Is it really intended that the phrase "(empty string)" is the pattern for the empty string? In that case the phrase "empty string" must be recognized as a metacharacter and an escape sequence must be provided for it. But I hope what you mean is that () is the regular expression for an empty string. The definition of metacharacter missed |.
Received on Tuesday, 14 November 2000 15:39:26 UTC