Discrepancies in the W3C Schema docs? from Dan Maharry on 2007-06-08 (xmlschema-dev@w3.org from June 2007)

From: Dan Maharry <dan@mcd.coop>
Date: Fri, 8 Jun 2007 16:23:31 +0100
To: <xmlschema-dev@w3.org>
Cc: "Dan Maharry" <dan@mcd.coop>
Message-ID: <557337CEDE6F294DBF74610FD7FF7BDF543483@nbhex1.osgcs.local>

All I did was try to write a small set of extension methods to validate
whether a given string was valid according to the built-in schema string
types and the editor in me comes out and starts nit picking. The W3C
Schema docs are very good but sometimes annoyingly ambiguous without a
degree in lateral thinking.

Problem #1 : Is "" valid?

Section 3.2.1 says

The *value space* of string is the set of finite-length sequences of
characters (as defined in [XML 1.0 (Second Edition)]) that *match* the
Char production from [XML 1.0 (Second Edition)].

So, is the empty string valid then? Taking this definition on spec, the
answer seems to depend on what 'finite-length' means. According to the
dictionary finite means

1.having bounds or limits; not infinite; measurable.
2.Mathematics.

(of a set of elements) capable of being completely counted.
not infinite or infinitesimal.
not zero.

So maybe an empty string isn't valid then? The dictionary implies it.
Alas, no. The XML Schema spec at the top of section 4 also states

Any property identified as a having a set, subset or *list* value may
have an empty value unless this is explicitly ruled out:this is not the
same as absent.

OK, so the empty string is valid as a string but could the W3C please
link to this last note about sets containing the empty value from the
many uses of the word 'set' around the document please? Either that or
define the phrase 'finite-length' in situ as 'zero or greater'.

Problem #2 : In which string data types is "" invalid?

The problem with the note about sets is that it states a type must
explicitly rule the empty string as invalid before it really is invalid.
But what about it being implied elsewhere but not in black and white as,
say the value space of the NMTOKENS type?

NMTOKENS represents the NMTOKENS attribute type from [XML 1.0 (Second
Edition)]. The *value space* of NMTOKENS is the set of finite,
non-zero-length sequences of *NMTOKEN*s

Let's go one step back up the type hierarchy to the NMTOKEN type.

NMTOKEN represents the NMTOKEN attribute type from [XML 1.0 (Second
Edition)]. The *value space* of NMTOKEN is the set of tokens that
*match* the Nmtoken production in [XML 1.0 (Second Edition)].

No explicit mention of non-zero-length anything here. But the definition
of the NMTOKEN in XML 1.0 says that it should consist of one or more
characters.

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar |
Extender
Nmtoken ::= (NameChar)+

By those rules, a valid NMTOKEN cannot be empty even if the writer or
the schema sets minLength to 0. The same logic applies to the Language
and Name string types in the schema definition as well so if none of
them can be empty, neither can NCName, ID, IDREF, IDREFS, ENTITY or
ENTITIES either despite the fact that only IDREFS and ENTITIES are the
only of these to also mention valid types to be non-zero-length
explicitly.

So then, what phrase is missing from "must explicitly rule the empty
string as invalid" because it's definitely not all there.

Problem #3 : Colons or not?

The next issue spans three W3C recommendations and it's a question of
colons. In the XML Schema document,

[the Name type is] the set of all strings which *match* the Name
production of [XML 1.0 (Second Edition)].

From the XML spec, the Name production looks like this

NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar |
Extender
Name ::= (Letter | '_' | ':') (NameChar)*

The Name type has several derived types - ID, IDREF and ENTITY all of
which are defined similarly and which have the same ambiguity. Let's use
IDREF

IDREF represents the IDREF attribute type from [XML 1.0 (Second
Edition)]. The *value space* of IDREF is the set of all strings that
*match* the NCName production in [Namespaces in XML]. The *lexical
space* of IDREF is the set of strings that *match* the NCName production
in [Namespaces in XML].

>From the [Namespaces in XML] spec then, the basic gist of the NCName
production is that it's the same as the Name production in [XML 1.0
(Second Edition)] but without the colons

OK? Name with colons. NCName without. Now the XML spec defines the IDREF
attribute type as follows

Values of type IDREF must match the Name production....

So then, values of the schema type IDREF which cannot have colons must
be able to represent XML IDREF attributes which can have colons. Is it
me or is there potential for a problem with that? I realise that
'represent' doesn't mean 'be the same as' but still.

Problem #4 : Single spaces or more?

Last issue is another ambiguity which could be easily sorted if the W3C
ever revised the Schema docs. At the bottom of the string type
derivation tree are two 'plural' types, IDREFS and ENTITIES. Both are
defined in the same way, so let's use IDREFS.

IDREFS represents the IDREFS attribute type from [XML 1.0 (Second
Edition)]. The *value space* of IDREFS is the set of finite,
non-zero-length sequences of IDREFs. The *lexical space* of IDREFS is
the set of space-separated lists of tokens, of which each token is in
the *lexical space* of IDREF.

For me at least, the ambiguity is in the word "space-separated". How
many spaces? Whitespace in general or literally just the space
character, \x20? Again, we have to consult the XML specification to get
the answer where we're told

values of type IDREFS must match [the] Names [production]

and [the] Names [production] reveals that it means each IDREF must be
separated by a single \x20 character only else the string isn't a valid
IDREFS type string.

Names ::= Name (#x20 Name)*

So why can't the schema spec just say something like

The *lexical space* of IDREFS is the set of lists of tokens each
separated by a single \x20 character,....

and take the ambiguity out of the statement?

Thanks,

Dan Maharry

P.S. This is formatted slightly better online at
http://blogs.ipona.com/dan/archive/2007/05/17/8381.aspx

The Midcounties Co-operative is an innovative co-operative business, owned by its customers and staff in the 9 counties it spans. We trade in a
number of retail sectors including food, travel, funerals, motors, childcare, pharmacy, post offices and IT. We are proud to be a successful
co-operative, founded on co-operative values and principles that co-ops share throughout the world.

This e-mail is confidential and is for the named recipient(s) only. If you are not the named recipient(s) please do not disseminate or copy this
e-mail, but please delete it and any copies from your computer. The Midcounties Co-operative has taken reasonable precautions to ensure that
any attachment to this e-mail has been checked for viruses. However, we cannot accept liability for any damage sustained as a result of
any such viruses and advise you to carry out your own virus checks before opening any attachment. Furthermore, we do not accept responsibility for any
change made to this message after it was sent by the sender.

*** The Midcounties Co-operative works to protect our environment ***
*** Please don't print this e-mail unless you really need to ***

This Message has been Scanned by SurfControl(c) Email Filter

Received on Saturday, 9 June 2007 01:20:29 UTC