W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2001

Re: Specifying Datatype Atoms in Regular Expressions

From: Kent Johnson <kentj@rsn.hp.com>
Date: Fri, 28 Sep 2001 19:12:29 -0500 (CDT)
To: www-xml-schema-comments@w3.org
Cc: k-kawa@bigfoot.com
Message-ID: <Pine.HPX.4.21.0109281702170.29020-100000@rangebal.rsn.hp.com>
i searched google to find if anyone had brought this up before and found
this post... the webpage that linked to the post that started this
discussion was found at:

http://lists.w3.org/Archives/Public/www-xml-schema-comments/2001JanMar/0425.html

i would like to know if anyone is considering this...  i REALLY think it
should be added to xml schema.. my comments are mixed in below:

> Date: Thu, 29 Mar 2001 13:18:09 -0800
> From: Kohsuke KAWAGUCHI <k-kawa@bigfoot.com>
> To: www-xml-schema-comments@w3.org
> Message-Id: <20010329130101.6747.K-KAWA@bigfoot.com>
> Subject: Re: Specifying Datatype Atoms in Regular Expressions
> 
> 
> I'm not a WG member, so the following is just my personal opinion.

nor am i

> Your proposal might be useful, but it has several flaws.
> 
> 
> > <!-- declare the datatype using the proposed /x{} syntax -->
> > <xsd:simpleType name="Percentage">
> >     <xsd:restriction base="xsd:string">
> >         <xsd:pattern value="\x{xsd:float}%" />
> >     </xsd:restriction>
> > </xsd:simpleType>
> > 
> > <!-- declare an element schema using the Percentage datatype -->
> > <xsd:element name="AVCommand">
> >   <xsd:complexType>
> >     <xsd:attribute name="volume">
> >        <xsd:simpleType>
> >           <xsd:restriction base="Percentage">
> >             <xsd:minInclusive value="12%" />
> >             <xsd:maxInclusive value="45%" />
> >           </xsd:restriction>
> >        </xsd:simpleType>
> >     </xsd:attribute>
> >   </xsd:complexType>
> > </xsd:element>
> 
> First of all, since your "Percentage" type is based on "string" type,
> rather than "float" type, you can't apply minInclusive/maxInclusive
> facets to it. I understand what you want to do, but you can't expect the
> validating processors to understand it.
> 
> So probably your example should be
> 
> <simpleType name="float12-45">
>   <restriction base="float">
>     <minInclusive value="12" />
>     <maxInclusive value="45" />
>   </restriction>
> </simpleType>
> 
> <simpleType>
>   <restriction base="string">
>     <pattern value="\x{float12-45}%" />
>   </restriction>
> </simpleType>

the above is a perfect example.

> Even so, you can't expect the validating processors to validate things
> 
> like (\x{float12-45})+

yes you can.  regular expression parsers have to deal with a similar
prooblem all the time.  if it had to match a+ and it was given "aaa" where
would it match?  well, the xml schema recommendation says its regexps are
based on the Perl regexps (with a slight tweak).  the Programming Perl
book (the camel book) published by O'Reilly states Rule 1 of regular
expression matching in perl as "The Engine tries to match as far left in
the string as it can..." (my page 60).  so in the "aaa" case it matches on
the first "a", and doesn't care what is left.

however, we aren't trying to match merely part of a line line in perl, we
need to match the whole thing.. so the a+ would be like ^a+$ ... so the
regexp engine would see that "aaa" matches a+ and continue 

now if we wanted to match 2 float12-45's in a row like
\x{float12-45}\x{float12-45} and we were given the string to match as
"1190", in perl we would get a match on "19" even though "11" and
"90" aren't float 12-45's.  but since we need to match from the
beginning, the regexp engine would try to match "1", then "11", then 
"119", then "1190" and then fail, since it hit the end.  standard
business.

but what if we had a float0-99 that was any integer 0 through 99, and we
wanted to match two in a row like \x{float0-99}\x{float0-99} ... on the
string "1190" we would match the first float as "1" and the second as the
second "1" and then fail, since we had "90" left over...

this is the fault of the designer.  you can't string two numbers together
without any punctuation and expect to be able to tell what goes
where.. that's just the way things go, even outside of the
computer realm.  notice when we write dates we say 4-23-1981 or 4/23/1981
not 4231981.. and we have time as 5:17:16 not 51716...

so if you have punctuation in between, this can be very useful (see
example stated later below).

> > I would also expect that any parser worthy of handling regular
> > expressions as they are currently defined should be able to extend
> > itself to handling this new syntax with a minimum of effort.
> 
> This is definitely no. Because it is very difficult to create regular
> expression of facet-restricted datatype. I would rather say it's
> impossible.

it's definitely yes.  it's just as difficult as all the rest of regular
expression matching.

> I'm sorry to say that, but your proposal is unable to implement.

I think it is quite easily implemented, and in fact I think it should
be.  let me give an example of where it would be especially useful.

say we wanted an attribute in an xml file to be the ip address of a
system.  nowadays what do we use, xsd:string?  that's really not gunna cut
it.  this leaves validation to the person that does the parsing.. or
worse, no validation is implemented, and errors start flying after someone
forgets and puts 256 instead of 255.. so how about regular
expressions?  here is an xml schema simpleType for an ip address using
regexps...

<xsd:simpleType name="IPAddress">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])"/>
    </xsd:restriction>
</xsd:simpleType>

i've never seen something so unmaintainable :)  now lets see an example
with the proposed \x{data_type} addition to the regexps...

<xsd:simpleType name="IPAddress">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="\x{xsd:byte}\.\x{xsd:byte}\.\x{xsd:byte}\.\x{xsd:byte}"/>
    </xsd:restriction>
</xsd:simpleType>

..ahhh, that's better... and since we don't have numbers all strung
together, there is no ambiguity between what numbers start and end
where.  can anyone think of why not to add such a feature?  i figured it
would already exist, as it is so obviously needed in my mind... thanks for
the consideration

regards,
kent

> 
> regards,
> ----------------------
> K.Kawaguchi
> E-Mail: k-kawa@bigfoot.com
Received on Friday, 28 September 2001 20:10:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 6 December 2009 18:12:51 GMT