RE: limits of regular expressions from Colin Mackenzie on 2002-08-27 (xmlschema-dev@w3.org from August 2002)

From: Colin Mackenzie <colin@elecmc.com>
Date: Tue, 27 Aug 2002 09:10:10 +0100
To: <noah_mendelsohn@us.ibm.com>
Cc: <r.becker@Nitro-Software.com>, <xmlschema-dev@w3.org>
Message-ID: <PJEOKNCGPHICMMDPMNAJKEHNCEAA.colin@elecmc.com>
No problem re discussion.

It is also not obvious to me how to extend the schema for these features but
perhaps it would be worth a short debate.

Why would the following be difficult to implement/a bad idea? I do not
develop parsers so I am sure there may be good technical reasons.

a) have a new element type to implement new validation, say <xs:test
expr="expression.."/>
The only feature of this element is to cause a validation error if the
expression resolves to False

b) the expression supports XPath syntax so,

<xs:test expr="/document/otherelement"/>
resolves to true if the other element exists (a co-occurrence constraint)

<xs:test expr="/document/element[1] = /document/otherelement[1]"/>
resolves to true if the content of the two element instances are the same

etc etc including the ability to add, subtract, multiply and divide BUT say
no functions, no if else logic no changing the elements allowed at certain
points based on values of elements otherwise the whole thing would get too
complicated

I know this would mean potentially horrendous expressions (when calculating
a check digit using 10 other values) but at least something as core as this
could be done within the schema itself.

I guess another problem occurs due to the use of XPath as the final element
names may not be known (say within a schema file containing useful types to
be used within other schemas) but this is understood by users of xs:key and
accepted as a limitation (perhaps xs:test could only occur at the same place
xs:key does. perhaps the XPath does not support relative paths).This would
make the whole thing less useful of course.

c)to solve the original checksum issue we would have to do more as the
digits used in the checksum were contained within element content and
checked with a RegExp rather than being in directly addressable XPath nodes.
So, would it be a good idea to extend XPath to allow the selection of a
piece of node content using a regExp?
Xpath already supports string functions for breaking down content so surely
the idea of RegExp support is not too far fetched?
In Perl and other Regexp implementations you can create a RegExp and put any
piece of the Reg exp in Parenthesis to identify it as something that should
be passed back to the user as $1, $2 etc (you know what I mean)
So using the example above we could do something like

<xs:test expr="/document/otherelement[1]/[0-9]{2}([0-9]) =
/document/otherelement[1]"/[0-9]{5}([0-9])>

would test that the value of the third digit is the same as that of the
sixth digit.

Ok, this is probably the direction that you don't want to go and there may
be numerous gaping holes but as someone who has had to write several complex
schemas I would really appreciate co-occurrence constraints and some
arithmetic checking within the body of the schema (rather than as a
Schematron post process)


Colin




-----Original Message-----
From: xmlschema-dev-request@w3.org
[mailto:xmlschema-dev-request@w3.org]On Behalf Of
noah_mendelsohn@us.ibm.com
Sent: 25 August 2002 04:05
To: Colin Mackenzie
Cc: r.becker@Nitro-Software.com; xmlschema-dev@w3.org
Subject: RE: limits of regular expressions



Sorry, I did not write as carefully as perhaps I should have.  Obviously,
there is no value in discouraging reasonable discussion.  What I meant to
say was:  the design direction signalled by the discussion suggests the
risk of a slippery slope.  I still think that's true.  Though I'd be glad
to hear reasonable suggestions that would prove me wrong, it's not
immediately obvious to me how to add features to the schema language that
would do reasonably generalized check-digit calcuations, in that would be
the sort of more widely useful features that would represent a good 80/20
compromise in terms of power, general utility, simplicity and portability.
  I certainly never meant to discourage discussion, and if I appeared to I
apologize.

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------







"Colin Mackenzie" <colin@elecmc.com>
Sent by: xmlschema-dev-request@w3.org
08/24/02 05:57 AM


        To:     <noah_mendelsohn@us.ibm.com>
        cc:     "Rainer Becker" <r.becker@Nitro-Software.com>,
<xmlschema-dev@w3.org>
        Subject:        RE: limits of regular expressions



Discussion does not lead us down a slippery slope, only perhaps incorrect
decisions made as a result of discussion.
I have found the discussion illuminating already (see your response
below).

Of course there will be a limit as to what a schema language can achieve
sensibly.

Of course users would like the schema language to allow as much "early
data
checking" as possible (I remember checking the values of HTML form fields
using server side CGI before we had decent client side JavaScript).

Surely there is nothing wrong with debating the issue as it only leads to
further understanding, and possibly the odd good idea for the next version
of schema?

FYI - even a stolen credit card can have a valid number, it is just not
legal to use it. The last time I did any credit/debit card processing (a
number of years ago) most systems only perform the checksum (on the till
or
client) during a purchase and only do a "server side" check for stolen
cards/credit limits off-line in batches.

Colin

-----Original Message-----
From: noah_mendelsohn@us.ibm.com [mailto:noah_mendelsohn@us.ibm.com]
Sent: 23 August 2002 18:21
To: Colin Mackenzie
Cc: Rainer Becker; xmlschema-dev@w3.org
Subject: RE: limits of regular expressions


I think this discussion is leading us down a slippery slope.  The schema
recommendation is clear that no language, other than a Turing-complete
programming language, can provide all the validation one might reasonably
want for one application or another.  From section "1.1 Purpose" [1]:

"Any application that consumes well-formed XML can use the XML Schema:
Structures formalism to express syntactic, structural and value
constraints applicable to its document instances. The XML Schema:
Structures formalism allows a useful level of constraint checking to be
described and implemented for a wide spectrum of XML applications.
However, the language defined by this specification does not attempt to
provide all the facilities that might be needed by any application. Some
applications may require constraint capabilities not expressible in this
language, and so may need to perform their own additional validations."

The proposed requirement in this case seems to be to have enough
computational capability to derive some sort of check digit in a credit
card number or similar code.  Well, there will always be things we cannot
validate.  For example, we can make sure that a credit card looks like a
credit card number, to some degree, but we cannot hope to prove that the
card isn't stolen.  That's presumably what it really means for a credit
card number to be valid.

Consider the requirements of a mathematician.  Would it not be reasonable
for him or her to request the ability to derive a sub type of integer to
be known as "PrimeNumber"?  Are we supposed to validate that -- make sure
the number is prime?

My point is that systems like schema can embody a reasonable level of
checking, but cannot in general meet the validation needs of particular
applications.  Schemas can give you a pre-filter, and some very useful
constraints that aid in mapping to data structures and databases, and that
greatly simplify the validation remaining to be done by applications. Even
our mathematician will be glad that we check for positive integer, which
significantly facilitates the work that he or she then has to do to prove
primeness.

Bottom line: I think that regex's represent a very reasonable 80/20 point
in the design space.  They provide a quite powerful and generally useful
level of checking, without requiring that we invent a portable programming
language in which to capture additional logic.  Thank you very much.

[1] http://www.w3.org/TR/xmlschema-1/#intro-purpose

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Tuesday, 27 August 2002 04:16:37 UTC