Re: Express length constraints in a regex or use maxLength and minLength? from C. M. Sperberg-McQueen on 2011-01-09 (xmlschema-dev@w3.org from January 2011)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Sun, 9 Jan 2011 12:16:55 -0700
To: xmlschema-dev@w3.org
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, "Costello, Roger L." <costello@mitre.org>, Liam Quin <liam@w3.org>
Message-Id: <03E10024-1BA0-446C-A7CF-042224A37089@blackmesatech.com>

On Jan 7, 2011, at 6:38 PM, Liam R E Quin wrote:
> 
> 
> That way leads to very fragile systems, where the slightest change to
> input data might lead to unknown consequences.
> 
> We saw this recently with XML 5e:

Strictly speaking, I think what XML 1.1 and 1.0 5e showed (among other
things) was that people are aware that their systems are sometimes
(often?) fragile in this way.  It did not, however, provide any examples
I've been able to learn of, of a system where the slightest change to
the rules actually breaks any data.   I've been asking for several years
to hear of a system that was stable for 1.0 names but actually did break, 
or would break, if confronted with 1.1 names.  I'll ask it again:  anyone
who can tell me of such a system (constructed for other purposes
than demonstrating fragility in the face of 1.1 names) will get a beer
from me when we are next in the same city.

> there were people with data binding
> libraries whose API assumed that an XML element name had the same
> lexical rules as a variable in the programming languages being used.
>  <boy><socks>black</socks></boy>
> could then be turned into
>  xmlObject.boy.socks
> 
> This failed when XML's character set for names changed.  Of course, the
> assumption was already wrong -- XML names can have "." and "-" in them.

Not to mention Tibetan and a lot of other scripts, most of which use
characters which are not legal in C identifiers.
>> 
>> 
>> Is this an accurate statement of your position?
> 
> It's an overstatement for me at least.
> 
> Sometimes it's better to mark data as questionable rather than invalid:
> accept it, but flag it for a double-check later.

When does "invalid" mean something other than "questionable"?

There is no law of nature that says invalid data must be rejected or
not processed.

> 
> If you go on to generate code that relies on your schema definition,
> and later you change the definition, you can have problems.
> 
> For my part I'd find a length greater than _any_ know last name,
> e.g. 10,000 characters, and I'd have data entry software give a
> warning for any unusual pattern - length greater than 20, contains
> accented characters (if in the US; elsewhere that's not so unusual),
> does not start with an upper case letter (some don't, of course, ask
> the Marquis de Sade), contains a character not identified by Unicode as
> a letter, apostrophe, space, "." or "-" (your pattern also rejects ’ of
> course).

It may be worth pointing out that XSD union types are designed
to make this kind of thing relatively easy:  you can define several
simple types, for example one for the simplest most regular values,
another for values which are less likely (and thus more likely to 
require special handling), and at the bottom one which is (as
Roger Costello has suggested) essentially a renamed version
of xsd:string (or xsd:string itself).

The application can then (if the XSD validator provides access to
the appropriate information in the PSVI) dispatch the value for further 
processing to an appropriate routine or workflow suitable for a
particular class of input.

Michael Sperberg-McQueen

-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com 
* http://cmsmcq.com/mib                 
* http://balisage.net
****************************************************************

Received on Sunday, 9 January 2011 19:17:26 UTC