- From: Dan Maharry <dan@mcd.coop>
- Date: Fri, 8 Jun 2007 16:23:31 +0100
- To: <xmlschema-dev@w3.org>
- Cc: "Dan Maharry" <dan@mcd.coop>
All I did was try to write a small set of extension methods to validate whether a given string was valid according to the built-in schema string types and the editor in me comes out and starts nit picking. The W3C Schema docs are very good but sometimes annoyingly ambiguous without a degree in lateral thinking. Problem #1 : Is "" valid? Section 3.2.1 says The *value space* of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that *match* the Char production from [XML 1.0 (Second Edition)]. So, is the empty string valid then? Taking this definition on spec, the answer seems to depend on what 'finite-length' means. According to the dictionary finite means 1.having bounds or limits; not infinite; measurable. 2.Mathematics. (of a set of elements) capable of being completely counted. not infinite or infinitesimal. not zero. So maybe an empty string isn't valid then? The dictionary implies it. Alas, no. The XML Schema spec at the top of section 4 also states Any property identified as a having a set, subset or *list* value may have an empty value unless this is explicitly ruled out:this is not the same as absent. OK, so the empty string is valid as a string but could the W3C please link to this last note about sets containing the empty value from the many uses of the word 'set' around the document please? Either that or define the phrase 'finite-length' in situ as 'zero or greater'. Problem #2 : In which string data types is "" invalid? The problem with the note about sets is that it states a type must explicitly rule the empty string as invalid before it really is invalid. But what about it being implied elsewhere but not in black and white as, say the value space of the NMTOKENS type? NMTOKENS represents the NMTOKENS attribute type from [XML 1.0 (Second Edition)]. The *value space* of NMTOKENS is the set of finite, non-zero-length sequences of *NMTOKEN*s Let's go one step back up the type hierarchy to the NMTOKEN type. NMTOKEN represents the NMTOKEN attribute type from [XML 1.0 (Second Edition)]. The *value space* of NMTOKEN is the set of tokens that *match* the Nmtoken production in [XML 1.0 (Second Edition)]. No explicit mention of non-zero-length anything here. But the definition of the NMTOKEN in XML 1.0 says that it should consist of one or more characters. NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender Nmtoken ::= (NameChar)+ By those rules, a valid NMTOKEN cannot be empty even if the writer or the schema sets minLength to 0. The same logic applies to the Language and Name string types in the schema definition as well so if none of them can be empty, neither can NCName, ID, IDREF, IDREFS, ENTITY or ENTITIES either despite the fact that only IDREFS and ENTITIES are the only of these to also mention valid types to be non-zero-length explicitly. So then, what phrase is missing from "must explicitly rule the empty string as invalid" because it's definitely not all there. Problem #3 : Colons or not? The next issue spans three W3C recommendations and it's a question of colons. In the XML Schema document, [the Name type is] the set of all strings which *match* the Name production of [XML 1.0 (Second Edition)]. From the XML spec, the Name production looks like this NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender Name ::= (Letter | '_' | ':') (NameChar)* The Name type has several derived types - ID, IDREF and ENTITY all of which are defined similarly and which have the same ambiguity. Let's use IDREF IDREF represents the IDREF attribute type from [XML 1.0 (Second Edition)]. The *value space* of IDREF is the set of all strings that *match* the NCName production in [Namespaces in XML]. The *lexical space* of IDREF is the set of strings that *match* the NCName production in [Namespaces in XML]. >From the [Namespaces in XML] spec then, the basic gist of the NCName production is that it's the same as the Name production in [XML 1.0 (Second Edition)] but without the colons NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar | Extender NCName ::= (Letter | '_') (NCNameChar)* OK? Name with colons. NCName without. Now the XML spec defines the IDREF attribute type as follows Values of type IDREF must match the Name production.... So then, values of the schema type IDREF which cannot have colons must be able to represent XML IDREF attributes which can have colons. Is it me or is there potential for a problem with that? I realise that 'represent' doesn't mean 'be the same as' but still. Problem #4 : Single spaces or more? Last issue is another ambiguity which could be easily sorted if the W3C ever revised the Schema docs. At the bottom of the string type derivation tree are two 'plural' types, IDREFS and ENTITIES. Both are defined in the same way, so let's use IDREFS. IDREFS represents the IDREFS attribute type from [XML 1.0 (Second Edition)]. The *value space* of IDREFS is the set of finite, non-zero-length sequences of IDREFs. The *lexical space* of IDREFS is the set of space-separated lists of tokens, of which each token is in the *lexical space* of IDREF. For me at least, the ambiguity is in the word "space-separated". How many spaces? Whitespace in general or literally just the space character, \x20? Again, we have to consult the XML specification to get the answer where we're told values of type IDREFS must match [the] Names [production] and [the] Names [production] reveals that it means each IDREF must be separated by a single \x20 character only else the string isn't a valid IDREFS type string. Names ::= Name (#x20 Name)* So why can't the schema spec just say something like The *lexical space* of IDREFS is the set of lists of tokens each separated by a single \x20 character,.... and take the ambiguity out of the statement? Thanks, Dan Maharry P.S. This is formatted slightly better online at http://blogs.ipona.com/dan/archive/2007/05/17/8381.aspx The Midcounties Co-operative is an innovative co-operative business, owned by its customers and staff in the 9 counties it spans. We trade in a number of retail sectors including food, travel, funerals, motors, childcare, pharmacy, post offices and IT. We are proud to be a successful co-operative, founded on co-operative values and principles that co-ops share throughout the world. This e-mail is confidential and is for the named recipient(s) only. If you are not the named recipient(s) please do not disseminate or copy this e-mail, but please delete it and any copies from your computer. The Midcounties Co-operative has taken reasonable precautions to ensure that any attachment to this e-mail has been checked for viruses. However, we cannot accept liability for any damage sustained as a result of any such viruses and advise you to carry out your own virus checks before opening any attachment. Furthermore, we do not accept responsibility for any change made to this message after it was sent by the sender. *** The Midcounties Co-operative works to protect our environment *** *** Please don't print this e-mail unless you really need to *** This Message has been Scanned by SurfControl(c) Email Filter
Received on Saturday, 9 June 2007 01:20:29 UTC