W3C home > Mailing lists > Public > xmlschema-dev@w3.org > January 2011

RE: Express length constraints in a regex or use maxLength and minLength?

From: Liam R E Quin <liam@w3.org>
Date: Fri, 07 Jan 2011 20:38:21 -0500
To: "Costello, Roger L." <costello@mitre.org>
Cc: "xmlschema-dev@w3.org" <xmlschema-dev@w3.org>
Message-ID: <1294450701.22336.299.camel@desktop.barefootcomputing.com>
On Tue, 2011-01-04 at 05:25 -0500, Costello, Roger L. wrote:
[...]
>    Thus, I determined that for my particular system, family
>    name values must consist of these characters:
>     - a-z and A-Z
>     - space
>     - period
>     - apostrophe
>     - dash
>    And I determined that for my particular system, family
>    name values are between 1 and 100 characters. These are
>    my operational constraints and I created a simpleType
>    that exactly expresses these constraints, no more and 
>    no less. If my operational constraints change (e.g.,
>    I get users with family names longer than 100 characters)
>    then I will update my simpleType.

I think Mike and I were both pushing back on this idea, probably because
we've both seen it misapplied all too often.

It's a common idea - you define some constraints on your data and check
it on entry to the system, and then you don't need internal checks.

That way leads to very fragile systems, where the slightest change to
input data might lead to unknown consequences.

We saw this recently with XML 5e: there were people with data binding
libraries whose API assumed that an XML element name had the same
lexical rules as a variable in the programming languages being used.
  <boy><socks>black</socks></boy>
could then be turned into
  xmlObject.boy.socks

This failed when XML's character set for names changed.  Of course, the
assumption was already wrong -- XML names can have "." and "-" in them.

>    This is the simpleType that precisely meets my operational
>    constraints:
> 
>     <simpleType name="English-language-family-name">
>         <restriction base="string">
>             <minLength value="1" />
>             <maxLength value="100" />
>             <pattern value="[a-zA-Z' \.-]+" />
>         </restriction>
>     </simpleType>

Yes.


> Here is what I "think" is your position (sorry, I don't mean to put words in your mouth; this may not be your position):
> 
>    Operational constraints are constantly changing, so
>    any simpleType you create to express operational
>    constraints will be out-of-date almost as soon as
>    you've finished creating the simpleType. So, don't
>    constrain the data.  
> 
>    Instead of the above simpleType, use this:
> 
>     <simpleType name="English-language-family-name">
>         <restriction base="string" />
>     </simpleType>
> 
>    Essentially, this creates a synonym for the unconstrained
>    string data type.
> 
> Is this an accurate statement of your position?

It's an overstatement for me at least.

Sometimes it's better to mark data as questionable rather than invalid:
accept it, but flag it for a double-check later.

If you go on to generate code that relies on your schema definition,
and later you change the definition, you can have problems.

For my part I'd find a length greater than _any_ know last name,
e.g. 10,000 characters, and I'd have data entry software give a
warning for any unusual pattern - length greater than 20, contains
accented characters (if in the US; elsewhere that's not so unusual),
does not start with an upper case letter (some don't, of course, ask
the Marquis de Sade), contains a character not identified by Unicode as
a letter, apostrophe, space, "." or "-" (your pattern also rejects ’ of
course).

So it's not a question of whether the schema is right, it's a question
of how it's used, and how it will affect system design, and what changes
are expected, if not immediately, maybe within a decade or two...

Liam


-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
Received on Saturday, 8 January 2011 01:38:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 11 January 2011 00:15:31 GMT