RE: Express length constraints in a regex or use maxLength and minLength?

Michael Kay wrote:

> I would seriously question why you want to 
> impose a limit of 100 characters on a string.

I did an investigation into the set of characters that are used in English family names and the length of English family names. I found that 99.999% of all English family names are no longer than 100 characters. If that is the case, why would I not impose a limit of 100 characters? If I don't impose a limit then the risk of getting unwanted/malicious values increases and bad data is not caught until further downstream, which raises the cost of fixing the problem. By identifying the operational restrictions and incorporating them into my XML Schema I reduce risk and lower costs. Do you agree?

/Roger 



-----Original Message-----
From: Michael Kay [mailto:mike@saxonica.com] 
Sent: Monday, January 03, 2011 3:10 PM
To: Costello, Roger L.
Cc: xmlschema-dev@w3.org
Subject: Re: Express length constraints in a regex or use maxLength and minLength?

On 03/01/2011 19:44, Costello, Roger L. wrote:

I can't add to your list of advantages/disadvantages, but I would 
seriously question why you want to impose a limit of 100 characters on a 
string.

Some people seem to do this as an ingrained habit - they haven't got rid 
of the punched-card mentality where strings were always fixed length.

There may be good reasons for doing it - for example, the data is going 
to be processed by an ancient COBOL application with limits that you 
can't afford to change; or you want to protect against certain kinds of 
DOS attack - but most of the time I see this kind of thing, the 
constraints are spurious. For example, people will put a limit of 10 
characters on a phone number because they've never travelled widely 
enough to realize that's not a hard limit at all.

Michael Kay
Saxonica
> Hi Folks,
>
> I am interested in hearing your thoughts on the advantages and disadvantages of the following two approaches to restricting the length of a string value.
>
> Approach #1: In this simpleType the regex does not restrict the length; instead, the minLength and maxLength facets are used to restrict the length:
>
>      <simpleType name="English-language-family-name">
>          <restriction base="string">
>              <minLength value="1" />
>              <maxLength value="100" />
>              <pattern value="[a-zA-Z' \.-]+" />
>          </restriction>
>      </simpleType>
>
>
> Approach #2: Here is the same simpleType except the length restriction is implemented in the regex:
>
>      <simpleType name="English-language-family-name">
>          <restriction base="string">
>              <pattern value="[a-zA-Z' \.-]{1,100}" />
>          </restriction>
>      </simpleType>
>
>
> The disadvantage of the first approach is that maxLength and minLength are non-transferrable length restriction mechanisms. They are not something that could be used directly by Schematron or HTML5.
>
> The disadvantage of the second approach is that an application would require sophistication to parse the regex to understand its length constraints.
>
>
> The advantage of the second approach is that the constraints are completely contained within the regex. Thus, the regex could, with little or no modification, be lifted and dropped into an XSLT regex expression or a Schematron regex expression or an HTML5 regex expression.
>
> The advantage of the first approach is that it is easier for a machine to determine the simpleType's length restrictions.
>
>
> What other advantages and disadvantages do each approach have? Which approach do you recommend? Why?
>
> /Roger
>
>

Received on Monday, 3 January 2011 20:26:05 UTC