W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > October to December 2000

comments on XML Schema Part 2 Datatypes

From: Fred Zemke <fred.zemke@oracle.com>
Date: Tue, 14 Nov 2000 12:38:40 -0800
Message-ID: <3A11A2CE.209346C5@oracle.com>
To: www-xml-schema-comments@w3.org
Comments on XML Schema Part 2: Datatypes
Candidate recommendation dated 24 October 2000
submitted by fred.zemke@oracle.com

These are comments generated on my first read of the document.
I am sorry if some of the comments reflect my ignorance.
Any such comments might still be useful as signposts for areas
that could be misunderstood.

2.3.1 Canonical lexical representation
The point should be made that the choice of canonical lexical
representation is not actually part of the type machinery.
For example, there is no statement of how to derive the canonical
representation of a derived type from the canonical representation
of the underlying types.  To do so would be very difficult.
For example, I might derive a type from float with a pattern that
specifically forbids the canonical representation of float.  My pattern
might require a leading or trailing 0, for example.  In that case, what
is the canonical representation of the derived type?  Another example:
the order of listing types in a union is significant because the earlier

types occlude the later types.  It may happen that the canonical
of an earlier type t1 occludes the canonical representation of a later
t2, yet there exists a noncanonical representation of a t2 value that is

not occluded.  Thus the canonical representation is something that the
document chooses to call out for each type, but users should not expect
that all types will have canonical representations.  Also, there is
no burden on the user to use canonical representations (when available). Equal
Initially this section presents Equal as a property of a value space.
Thus in the penultimate paragraph you present Equal as a predicate
of two arguments drawn from the same value space.
The last paragraph conflicts with this vision, because the last
paragraph assumes that there is a notion of Equal not tied to any
single value space.  Obviously such a notion cannot be a property of
a value space and hence cannot be a fundamental facet of a value space.

I think the answer is that the final sentence is confounding the notion
of the facet called equality (which pertains to a single value space)
with the mathematical notion of equality (which should be regarded as
part of the metalanguage being used to define the type system without
actually being part of the type system).  Thus the final sentence is
trying to make a statement about the mathematical disjointness of value

Perhaps you can clarify this by inserting something like the following:
"The operation Equal can be viewed as a restriction of the mathematical
notion of equality to a particular value space.  Thus we may also speak
about equality or inequality between elements of different value

As for the last paragraph, consider the types float, double and
decimal.  These
types are not related by restriction, yet I think that you want your
metanotion of equality to say that 1.0 the float equals 1.0 decimal. Order
I point out numerous places where you propose nontotal orderings for
types.  Perhaps instead of "total order" you mean "partial order". Bounds
Your definitions of bounded above and below are acceptable for finite
value spaces but not for value spaces that mathematicians would call
open, for example, the open interval between 0 and 1 is bounded in
both directions yet has no upper bound or lower bound in your sense.
Of course the real number line is not an XML value space, but the
value spaces of float and double is a very similar.  While it is
theoretically true that the set of float values less than 1 has an
upper bound, namely the largest float value just smaller than 1,
it will be very difficult for the user to specify it.
You have recognized this issue by providing the minExclusive and
maxExclusive facets, probably for this very reason.  But in those
definitions you mistakenly say that the attribute value is the upper or
lower bound, which it is not (according to your definition).

I think the solution is to follow the example of mathematics with
definitions such as the following:

[Definition:] If S is a subset of a value space V, then an upper bound
S is a value v in V such that s <= v for all v in V.

[Definition:] If S is a subset of a value space V, then the least upper
bound of S is the value v in V such that v is an upper bound of S,
and for all upper bounds v2, v <= v2.

[Definition:] If S is a subset of a value space V, then a proper upper
bound of S is a value v in V such that s < v for all v in V.

[Definition:] If S is a subset of a value space V, then the least proper

upper bound of S is the value v in V such that v is a proper upper bound
of S,
and for all proper upper bounds v2, v <= v2.

and then in the definition of maxInclusive, reference least upper bound,

while in the definition of maxExclusive, reference least proper upper

You will also need to address bounds for types that are not totally
ordered (for example, NaN in float and double is not comparable to
anything, I believe).  I have more to say on this under
maxInclusive. Cardinality
The second sentence says that some value spaces are uncountably
infinite.  While this can occur in mathematics, you have not defined
value space as a mathematical object; instead it is "the set of values
for a given datatype" and it is demonstrable that all XML datatypes
have countable set of values, since they are all representable by
finite strings over a finite character set.  The possibility
of uncountably infinite value spaces is immediately discounted in the
next few paragraphs. enumeration
Are there any plans to enable a user-defined ordering of an enumeration
type?  For example, the enumeration "requirements specification design
implementation test maintenance" of the life cycle of a software
product has a nice order, and people way wish to define restrictions
of ordered enumerations by using bounds. maxInclusive, maxExclusive, minInclusive, minExclusive,
I can see two ways to go in terms of supporting bounds for partially
ordered datatypes.  One way is to say that the derived type consists
of all values of the source type that are <=, <, >= or > the specified
bound (depending on the kind of bound).  Thus all values in the
derived type will be comparable to the bound.  The other way is to
say that the derived type consists of those values, and also any
values that can not be compared to the bound.  Either kind appears
useful.  So you could have eight facets instead of four.  Or you
could introduce one new facet to indicate whether incomparable values
are included or excluded.  Right now the latter appeals to me, mainly
because I can't think of any good names for the eight facets.
So I am suggesting a facet, perhaps called includeIncomparables, which
might be a boolean, where true means that the incomparables are included

and false means that they are not.  I have no idea what the default
for this facet should be.

2.5.1 Atomic vs. list vs. union
It is not specified how to derive many of the aspects of a derived type
from the information supplied by the base type and constraining facets.
I am thinking of the following: equality, order, lexical representation,

canonical representation.  I will consider ways to specify these in
each of the three subdivisions of this section. Atomic datatypes.
For convenience I discuss restriction derived types here, though they
not mentioned in the heading.  It is probably fair to assume that
when a type is derived by restriction, then the equality, order
and lexical representation are simply the restriction (in the
sense) to the smaller value space.  This could be stated explicitly.

As for canonical representation, note that the restriction may involve a

pattern that forbids the canonical representation in the source type.
For example, derive a type from float using a pattern that requires a
leading or trailing zero.  In that case no canonical representation
in the source type is available in the derived type.  This looks like
an insoluble problem to me.  I conclude that not all types have a
canonical representation. List datatypes
In the case of a list type derivation, I will assume that equality is
defined to require that two lists s1 and s2 have the same number of
list elements, and that the elements are pair-wise equal in order left

4.1 "Datatype definition" says that lists are unordered.
This is acceptable, though it is not hard to induce an ordering on lists

if the list element type is ordered.

The lexical representation has been stated somewhere to be
a string which is the space-separated concatenation of representations
elements of the base type that do not contain embedded spaces.

You do actually define the canonical representation of list datatypes,
in the obvious way.   Of course this only works when the item type has
a canonical representation. Union datatypes
Union derived type is the trickiest.  I conjecture that the sentence
"During validation, an element or attribute's value is validated against

the participating types in the order in which they appear in the
until a match is found" is crucial.

Thus to define equality, first establish the
types of each value.  If these types are comparable, then compare them
that gives the answer.  If the types are not comparable, then the values

are not equal.

To define order for a union type, all the participating types must be
comparable (my term, admittedly only partially defined in these
in which case the common ordering of these types is used.
If the participating types are not all comparable, then it is simplest
to say that the union type is not ordered.  This appears to be the
in 4.1 "Datatype definitions" in the paragraph "If {variety} is union

As for canonical representation, this will be difficult to define for
unions because the first member type may have canonical representations
occlude canonical representations in the second member type.  For
let the first be strings whose pattern is the same as the canonical
representation of a float.  Then for example 1.1e0 will be found to be
of the first type and 0.11e1 will be found to be of the second type.
There is no canonical representation for 0.11e1 in this union type.
My conclusion is that you should explicitly say that this standard does
define a canonical representation for unions.

3.2.2 boolean
is this type ordered (perhaps true > false)? constraining facets (of boolean)
If you allow pattern, you might as well allow enumeration,
which will provide a more obvious way for people to indicate that only
true or only false is permitted.

3.2.3 float
first paragraph, last sentence, appears to be missing the word 'and'
in the phrase 'positive negative infinity'.

Your definition would seem to say that 0 = -0.  I think you should say
this explicitly because the statement that 0 and -0 are "special values"

might lead to the conclusion that -0 < 0.  Or maybe I'm wrong and -0 <
which would still illustrate that you need to make an explicit statement

about this.

The ordering is not total.
I assume that -INF is the least value and INF is the greatest
value, but where does NaN fit in the ordering? Canonical representation (of float)
This does not say whether a plus sign is permitted in the exponent of
a canonical representation.  Presumably they are forbidden, the same as
in the mantissa, but this should be stated explicitly.

I assume that 0E0 is the canonical representation of the special value
This should be stated explicitly. canonical representation (of decimal)
This does not say whether a decimal point is required or forbidden
when representing an integer.

3.2.6 timeDuration
The computation of end timeInstant t[s] is not well-defined because
the phrase "handling the carry-overs correctly" is not completely
specified.  For example, what is Jan 30 plus one month?

Ed. Note: you asked for feedback to solve this problem.  The solution
in the SQL standard is that there are two kinds of durations (SQL
calls them interval types), namely year-month and day-time.  These two
types are regarded as incompatible; they may not be combined into a
single type that represents, for example, one month plus one day.
As for the thorny issue of adding years or months to arrive at an
invalid date (e.g. Feb 30) SQL declared such operations to be runtime
exceptions.  This has the virtue of being intellectually pure and
the disadvantage that raising exceptions is not user-friendly.
Perhaps a better solution would have been to provide functions
enabling the user to choose what to do with these cases (round up into
the next month, round down to the last valid day of the month,
raise an exception, etc.)
When writing the OLAP Amendment to SQL:1999, the issue came up again
(in a single restricted context) and it was decided that when
date arithmetic arrived at an invalid date, use the last valid date
of the indicated month instead of raising an exception. lexical representation (of timeDuration)
second para: "The values of Year, ... allow an arbitrary integer."
I assume you mean nonnegative integer, since it seems that the sign
must prefix the entire string.  (We are talking about the lexical
representation here.  The components that are encoded may be negative,
but you must use positive values to denote those negative components,
with a single sign in front of the entire string.)

"Similarly the Second component allows an arbitrary decimal."  I think
you mean an arbitrary nonnegative decimal.

In the discussion of truncated representations, the second bullet says
"The lowest order item may have a decimal fraction".  Please clarify
how this works.  Since this is called a truncated representation,
presumably the intent is that it is a shorthand for something that
has a full representation.  But I don't understand what that would be
for P0.3Y or P0Y0.5M.  Of course one way to make this work is to allow
fractions in all components, not just seconds.  In that case though you
permit such things as P0.3Y0.5M.  Probably a simpler solution, though,
is that a decimal fraction is allowed in the lowest order time
but not in Year or Month.

no canoncial representation is defined.  Presumably this can be achieved

by saying that each numeric component is put in its canonical
representation (after factoring the minus sign to front of the entire
string if any).

3.2.7 recurringDuration
The first paragraph is a vast improvement over the previous text that
I saw (dated Sept. 22).

Second para, second sentence: "These facets specify the length of
the duration..." but length is a word with a reserved meaning in this
document.  Granted you did not make a hot link back to the definition
of length, but still it would be better to avoid this word, or
at least clarify it, for example "length (i.e., quantity of time,
not lexical length) of the duration"

Second para, third sentence: "The lexical format used to specify these
facet values is the lexical format for timeDuration".  Does this imply
that the datatype of these facets is timeDuration?  It seems that the
more correct way to specify this would be to say that the datatype of
the facets is timeDuration, from which it would follow that their
format (as well as the value spaces) would be those of timeDuration.

Next two sentences: "A value of 0 for the facet period ... .
A value of 0 for the facet duration...".  But 0 is not in the value
of these facets.  Probably what is meant is P0Y (more fully,

Are negative values for duration and period permitted?  I conjecture
that they are not, but later I offer possible meanings for them if
they are.  If they are not, then you can further qualify the datatype
of the facets by saying that minInclusive is P0Y0M0DT0H0M0S.

Is it permitted for duration > period > P0Y?  For example,
a duration of one day recurring every hour.  This would result in
a set of overlapping 24 hour periods (a 24 hour period beginning at
midnight, a 24 hour period beginning at 1 AM, ...).

Are the periods envisioned as open or closed segments of the time line?
I guess closed segments because it is clear that you want
a single instant as the limiting case when the duration facet goes to
However people who work with temporal information usually prefer
segments (closed on the lower end, open on the upper) because it is
possible to partition the time line into disjoint closed-open intervals,

something you cannot do with closed intervals unless you have a granular

model of time.  Since you have not set a maximum number of places after
decimal point in seconds, it seems that you are avoiding a granular
Thus you may want to consider specifying closed-open intervals.  The
is that then you will have a discontinuity at duration P0Y, since in
set theoretic terms you'd like to define the closed-open interval as
{x | origin <= x < origin + duration} which is the empty set when
duration = P0Y, rather than a single instant as desired.
So I guess that you have closed intervals in mind,
mathematically {x | origin <= x <= origin + duration}.  Another
is you could say that you are not taking a position on whether the end
of intervals are in or out, except when length is P0Y then the endpoints

are in.

Something that is unclear is whether the sequence of periods
radiates both forwards and backwards in time from the origin, or just
forwards.   I think the words "starting from a specific point in time"
that the direction is forward only.  On the other hand, the type time,
derived from this type, seems to be thinking of a radiating both forward

and backward.

A problem, though, is that if the set of intervals radiates forward and
backward from the origins, then the origin is really just an arbitrary
representative of any interval's lower end point; there is no basis for
distinguishing any particular origin.  For example, if the
has period 1 year (P1Y) and the origin is January 1, 2000, then I could
equally well pick January 1, 1999 as the origin since both sets of
intervals are the same (being January 1 in every year throughout
Thus if you intend that the set of intervals radiates both forwards and
backwards in time, I think you should say that there is no order
unless period is P0Y (i.e., there is only a single instant or closed
interval).  In that case your order definition makes sense.

At this point I see that there may be a use for negative duration and
period values.  Negative duration would indicate the "specific point of
is the upper end point of the interval rather than the lower end point.
Negative period would indicate that the set of intervals radiates
backwards in time rather than forwards from the origin. Lexical representation (of recurringDuration)
First para: Must each of the fields CC, etc., be exactly two digits?
This would seem to be mandatory for CC and YY, otherwise 20 could be
interpreted as 0020, 0200 or 2000.  What about the others, where
punctuation would disambiguate?

you say that more than three digits of fractional seconds
precision may be specified.  What about less than three digits?  If
there are no fractions of a second, may the decimal point be omitted?

It is unclear whether there must be a time zone, or can the time zone
be entirely omitted?  The first paragraph does not even mention a time
Note that the first paragraph second sentence says
"The lexical representation is..." as if it were giving the complete
specification.  Then the second paragraph seems to say that you may
add a suffix for the time zone.  Finally,
under canonical representation, you say that a final Z is mandatory.

If the time zone can be omitted, it is unclear whether that means
that the datum is zoneless or has a default time zone.  To illustrate
the difference, consider the statement "I eat lunch at 12 noon".
The statement is irrespective of time zone, and means noon at whatever
time zone I am in; it does not mean 12 noon UTC when I am in San
People sometimes refer to this as common time.

By the way, the SQL standard has separate types called
to represent common time.

last para, last sentence: "other derived datatypes date, time,
timePeriod and
recurringDate use truncated versions of this lexical representation".
Wait a minute! Since when do derived types get to have a lexical
that would be unintelligible in the source type?  How do I as a user
do this with my own user-defined types?  I have not seen any machinery
for that.  If there is no machinery for the user to define an
representation for a derived type, then I do not think that these are
really derived types.  To me, a derived type means that
I just compile a schema definition and the derived type works right out
of the box.  Later I propose an incomplete solution,
which I call transform types. Canonical representation (of recurringDuration)
If you allow days, months, hours, minutes or seconds to be single
you must state whether the canonical representation is two digits with
a leading zero if necessary. lexical representation (of integer)
May an integer be represented with a final period?  Since this type is
restriction of decimal with scale = 0, it
would seem to allow both "1" and "1." to represent the same value.  Yet
text seems to disallow the final period.  If that is the intent, then
the integer type also has a pattern constraint to prohibit a final

3.3.25 time
It is unclear what are your intentions for ordering this type.
The first sentence says "time represents an instant of time that recurs
every day".  Since your lexical representation does not specify a
concrete origin on the time line (ie, what day the time instant is in),
it seems that you intend a value of
type time to represent the set of time instants, spaced periodically
every 24 hours, both forwards and backwards in time.
In that case there is no inherent reason for calling one time before or
after another.  For example, is 23:00 before or after 01:00?  If you
go by the digits, you might say 23:00 > 01:00.  But from another
23:00 is just 2 hours short of 01:00, and from that perspective you
would think 23:00 < 01:00.  As for 00:00 vs 12:00, these are equally
spaced on the dial and I don't see a reason to call either one before or

after the other. lexical representation (of time)
There appears to be some magic going on here because the lexical
representations that are proposed here are not valid representations
of recurringDuration.  This violates the presumption that built-in
derived types are defined with precisely the same machinery that is
available to user-defined derived types.  When I look at the definition
of the type in appendix A "Schema for Datatype Definitions (normative)"
there are no attributes or elements that would explain the change
in lexical representation.  Thus it is not true that an implementation
can define the primitive built-in types and simply compile the
of the derived built-in types.

On the positive side, this may indicate the way to a useful extension
of XML Schema.  What appears to be needed is a way to define a derived
type by means of a transform of the content to the lexical
of another type.  In the case of time, the
transformation is to prefix the year month and day (say prefix
More generally, I think you should allow transformations that are
as a concatenation of literals and substrings of the the content, using
a regular expression to analyze the content into substrings.

Here is a straw man proposal: A transform type involves three facets,
the base type, a pattern, and a transform.

The pattern is analyzed
as a regular expression.  The syntax for regular expressions will need
be enhanced with one more metacharacter that serves as a separator to
divide the pattern into substrings, for example, ! .  This metacharacter

will not have any meaning aside from its use in the pattern facet of
transform types.

The transform facet is somewhat like a regular expression, in that it
has one metacharacter.  I will use & as the metacharacter, but the
is arbitrary.  The sequence &1& represents the first parameter, &2&
represents the second, etc., and && represents a single literal

Before giving an algorithm I will give an example.  Suppose you want a
type that allows the user to represent dates in the US fashion
(month slash day slash year).  The definition might be

<simpleType name="USDate">
  <transformation base="date"

In this example, the pattern facet says that the content must consist of

a string of digits, a slash, another string of digits, another slash,
and a final string of digits.  These five components are called
respectively &1& for the first string of digits, &2& for the first
&3& for the second string of digits, &4& for the second slash, and
&5& for the final string of digits.  For example, given
<USDate>10/22/1999</USDate>, &1& = 10, &2& = /, &3& = 22, &4& = /, &5& =
The transform facet says to assemble a new string from these parameters.

Thus <USDate>10/22/1999</USDate> is reassembled as

The algorithm to transform the content is as follows:
0. Let p be the pattern facet, t the transform facet and c the content
be transformed.
1. The pattern facet is split into n substrings at each occurrence of
separator metacharacter (!), forming subpatterns p1, p2, ..., pn.
2. Each subpattern must be a valid regular expression.
3. Assemble a new regular expression q = (p1)(p2)...(pn).  Note that
q1 does not have any instance of the separator metacharacter and
so it is a conventional regular expression.  The role of the parentheses

is to respect the boundaries of the subpatterns so that the
does not produce a regular expression that was not intended.
4. The content c must match q treated as a regular expression.
5. Let c1 be the shortest initial substring of the content c such that
matches s1 and the remainder d1 matches (p2)... (pn).
By remainder I mean the portion of c that is left after deleting the
initial substring c1.
6. Let c2 be the shortest initial substring of d1 such that c2 matches
and the remainder d2 matches (p3)...(pn).
7. In general, repeat step 6 iteratively until the content c has been
analyzed into substrings c1, c2, ..., cn such that c = c1c2c3...cn,
c1 matches p1, c2 matches p2, ... cn matches pn.
(By the way, you can not simply take step 7 as the definition of
substrings c1, c2, ... cn, because in general there can be more than one

way to parse a string into substrings to match a regular expression.
The algorithm presented works left to right within the string,
the first acceptable substring to match each subpattern.)
8. In the transform facet, each occurrence of &1& is replaced by c1,
each occurrence of &2& by c2, etc.

4.1 Datatype definition
second paragraph under the box, last sentence: "If {variety} is union
then the value space of the datatype defined will be the union of the
value spaces of each datatype in {base type definition}."  More
it will be some subset of the union of the value spaces of the
participating datatypes.  The reason is that the lexical representation
of a value in one datatype may be the same as the only lexical
representation in another datatype.  For example, consider a union of
integer and string in that order.  Then "1" will be the lexical
of an integer and not a string.  Consequently there is no lexical
representation for the string "1" in the union value space.  By the
definition in 2.2, every element of a value space must have a lexical
representation.  Since the string "1" has no lexical representation, it
cannot be regarded as part of the lexical space.

sixth paragraph: "If {variety} is atomic then bounded is true and
cardinality is finite if ...".  This is a confusing sentence
because it is not immediately clear what is the precedence or scope of
each logical operation.  Using parentheses, some of the choices are:
a) "(If {variety} is atomic then bounded is true) and (cardinality is
   finite if...)"
b) "If ({variety} is atomic) then ((bounded is true) and (cardinality
   is finite if...)"
c) "If ({variety} is atomic) then ((boudned is true and cardinality is
   finite) if ...)
As a start on avoiding this problem, never use "trailing if" as in
"it is warm if the sun shines", always use "leading if" and explicit
as in "If the sun shines, then it is warm".

anyway, I think what you are trying to say in the sixth paragraph is
"If {variety} is atomic then: if one of minInclusive or minExclusive
and one of maxInclusive or maxExclusive are among {facets}, then bounded

is true and cardinality is finite.  Otherwise, If {variety} is atomic,
then bounded is false and cardinality is countably infinite.
If {variety} is atomic, numeric and ordered are inherited from
{base type definition}."

I raise the following counterexamples to these rules:
1. boolean is finite but evidently unbounded since no ordering has been
proposed for boolean.
2. for float and double types, what about using INF, -INF or NaN as
3. float and double have finite cardinality, even when there are no
bounds.  Thus these types and their derivatives are all bounded.
4. decimal is bounded and has finite cardinality if a precision is
specified, even if no bounds are specified.
5. some, but not all, patterns will restrict any type to finite
6. string and binary types restricted to a specified length or maxLength

are finite cardinality though not bounded since there is no ordering.

seventh para, "If {variety} is list then...".  this paragraph could also

stand to be rearranged to bring all the if's to the front.  I think
the paragraph is trying to say the following:
"If {variety} is list
then if length or both minLength and maxLength are among <facets>
then bounded is true and cardinality is finite.  Otherwise if {variety}
is list then bounded is false and cardinality is countable infinite.
If {variety} is list, then numeric and ordered are false."

Assuming that is what you are trying to say, here are some
1. You don't need both minLength and maxLength in order to conclude that

bounded is true: minLength="0" can be regarded as implicit.
2. Even if bounded is true, you can still have countable infinite
value space because the bound is only counting the number of items in
the string, it is not bounding the values of those list items.  For
example, there are countably infinite lists of string with list length
3. It is possible to define a type so restrictive that it is empty.
In that case there are no lists of that type, so that even if the list
length is unbounded, it does not matter, the value space is still empty
and not countably infinite.  However you are basically right that if
a base type is nonempty then unbounded strings over that type will be
countably infinite.
4. But don't forget that list types can have the enumeration facet,
I believe will force the value space to be finite.

E. Regular expressions
First table, under definition of regular expression.  Is it really
that the phrase "(empty string)" is the pattern for the empty string?
In that case the phrase "empty string" must be recognized as a
and an escape sequence must be provided for it.  But I hope what you
mean is that () is the regular expression for an empty string.

The definition of metacharacter missed |.
Received on Tuesday, 14 November 2000 15:39:26 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:08:49 UTC