RE: Discrepancies in the W3C Schema docs? from Michael Kay on 2007-06-09 (xmlschema-dev@w3.org from June 2007)

From: Michael Kay <mike@saxonica.com>
Date: Sat, 9 Jun 2007 22:20:58 +0100
To: "'Dan Maharry'" <dan@mcd.coop>, <xmlschema-dev@w3.org>
Message-ID: <011601c7aadc$12f67fe0$6401a8c0@turtle>
Personal response.

> All I did was try to write a small set of extension methods 
> to validate whether a given string was valid according to the 
> built-in schema string types and the editor in me comes out 
> and starts nit picking. The W3C Schema docs are very good but 
> sometimes annoyingly ambiguous without a degree in lateral thinking. 

You are right. In particular, there is a tendency in the schema
specifications to use language that looks formal and precise and technical,
but actually cannot be understood without reading the mind of the editor.
The use of the adjective "finite-length" is a case in point. 

I think this adjective is vacuous. It's probably there because the author
was struggling to define "character string" in some way other than saying
it's a string of characters. I'm surprised to see that there are
dictionaries that define "finite" to exclude zero, because in my experience
mathematicians have always used "finite" to mean "not infinite", and zero is
definitely not infinite. (I pointed out some while ago that it would be hard
to write a test case to demonstrate that a processor rejects an
infinite-length string). 
> 
> Problem #2 : In which string data types is "" invalid?
> 
> The problem with the note about sets is that it states a type 
> must explicitly rule the empty string as invalid before it 
> really is invalid.
> But what about it being implied elsewhere but not in black 
> and white as, say the value space of the NMTOKENS type?
> 
> NMTOKENS represents the NMTOKENS attribute type from [XML 1.0 
> (Second Edition)]. The *value space* of NMTOKENS is the set 
> of finite, non-zero-length sequences of *NMTOKEN*s

I can't see your problem here. An NMTOKEN cannot be a zero-length string
because the XML 1.0 grammar rules it out, quite explicitly. And an NMTOKENS
cannot be a zero-length sequence of NMTOKEN values because the adjective
"non-zero-length" rules it out, again quite explicitly.
> 
> Problem #3 : Colons or not?
> 
...
> 
> IDREF represents the IDREF attribute type from [XML 1.0 
> (Second Edition)]. The *value space* of IDREF is the set of 
> all strings that
> *match* the NCName production in [Namespaces in XML]. The *lexical
> space* of IDREF is the set of strings that *match* the NCName 
> production in [Namespaces in XML].

I think the first sentence is just trying to be a helpful introduction. It
doesn't say anything normative. It's qualified by the more precise
statements in the second and third sentences.

I agree this isn't good spec writing. It's often useful to explain the
background or to give a summary of the purpose of the construct but ideally
one should distinguish carefully between that kind of expository material
and the formal definition. Many specs fail to achieve this balance between
helpfulness and precision, and it's a tough one to get right: editors will
get flak on this whatever they do.

Probably one of the particular difficulties with moving the schema specs
forward is that there are much bigger problems than these demanding the
attention of the WG, and the WG has very limited resources: for a spec that
is so widely used and implemented, and of such critical importance to the
industry, the actual number of people working on the project is tiny. I
recently joined the group because I came to the realization that they simply
didn't have the resources to deal with the bugs that I was submitting, and
that the only way to get a better spec would be to join in the effort.

However, for minor comments like these, the best approach is to enter a bug
report - one per problem - in the bugzilla database.

Michael Kay
http://www.saxonica.com/
Received on Saturday, 9 June 2007 21:21:06 UTC