RE: limits of regular expressions

Discussion does not lead us down a slippery slope, only perhaps incorrect
decisions made as a result of discussion.
I have found the discussion illuminating already (see your response below).

Of course there will be a limit as to what a schema language can achieve
sensibly.

Of course users would like the schema language to allow as much "early data
checking" as possible (I remember checking the values of HTML form fields
using server side CGI before we had decent client side JavaScript).

Surely there is nothing wrong with debating the issue as it only leads to
further understanding, and possibly the odd good idea for the next version
of schema?

FYI - even a stolen credit card can have a valid number, it is just not
legal to use it. The last time I did any credit/debit card processing (a
number of years ago) most systems only perform the checksum (on the till or
client) during a purchase and only do a "server side" check for stolen
cards/credit limits off-line in batches.

Colin

-----Original Message-----
From: noah_mendelsohn@us.ibm.com [mailto:noah_mendelsohn@us.ibm.com]
Sent: 23 August 2002 18:21
To: Colin Mackenzie
Cc: Rainer Becker; xmlschema-dev@w3.org
Subject: RE: limits of regular expressions


I think this discussion is leading us down a slippery slope.  The schema
recommendation is clear that no language, other than a Turing-complete
programming language, can provide all the validation one might reasonably
want for one application or another.  From section "1.1 Purpose" [1]:

"Any application that consumes well-formed XML can use the XML Schema:
Structures formalism to express syntactic, structural and value
constraints applicable to its document instances. The XML Schema:
Structures formalism allows a useful level of constraint checking to be
described and implemented for a wide spectrum of XML applications.
However, the language defined by this specification does not attempt to
provide all the facilities that might be needed by any application. Some
applications may require constraint capabilities not expressible in this
language, and so may need to perform their own additional validations."

The proposed requirement in this case seems to be to have enough
computational capability to derive some sort of check digit in a credit
card number or similar code.  Well, there will always be things we cannot
validate.  For example, we can make sure that a credit card looks like a
credit card number, to some degree, but we cannot hope to prove that the
card isn't stolen.  That's presumably what it really means for a credit
card number to be valid.

Consider the requirements of a mathematician.  Would it not be reasonable
for him or her to request the ability to derive a sub type of integer to
be known as "PrimeNumber"?  Are we supposed to validate that -- make sure
the number is prime?

My point is that systems like schema can embody a reasonable level of
checking, but cannot in general meet the validation needs of particular
applications.  Schemas can give you a pre-filter, and some very useful
constraints that aid in mapping to data structures and databases, and that
greatly simplify the validation remaining to be done by applications. Even
our mathematician will be glad that we check for positive integer, which
significantly facilitates the work that he or she then has to do to prove
primeness.

Bottom line: I think that regex's represent a very reasonable 80/20 point
in the design space.  They provide a quite powerful and generally useful
level of checking, without requiring that we invent a portable programming
language in which to capture additional logic.  Thank you very much.

[1] http://www.w3.org/TR/xmlschema-1/#intro-purpose

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------

Received on Saturday, 24 August 2002 05:58:32 UTC