Specifying Datatype Atoms in Regular Expressions

I recently raised the issue of deriving an attribute datatype that
combines a float value with a string units of measurement designation in
the xml.org xml-dev list.  After some discussion, the best solution
currently available with the proposed recommendation appears to be
declaring a string attribute and then applying an appropriate pattern
restriction to limit the attribute to the allowable digits, decimal
points and units designations in the proper order.

The point was then made that doing this eliminates the possibility of
type checking the attribute values against any inclusivity and
exclusivity facets in a type-specific manner, i.e. one cannot specify an
upper or lower limit on the float portion of the attribute.

Without getting into a discussion of all the possible potholes one can
fall into, especially with regards to unit conversions before applying
inclusivity/exclusivity facets, the question to the W3C Working Group,
"Is there merit to allowing a datatype to be specified as an atom in a
regular expression?"

I suggest something similar to the Unicode Database encoding, e.g.
\p{Lu} specifies all upper case letters.  For discussion purposes, let's
assume \x{datatype_name} is the adopted syntax.  A very simple example
follows declaring a datatype to represent a percentage value with the
'%' required in the actual XML, its use in defining an element's
attribute, and the subsequent declaration of an XML element that would
validate against the schema:

<!-- declare the datatype using the proposed /x{} syntax -->
<xsd:simpleType name="Percentage">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="\x{xsd:float}%" />
    </xsd:restriction>
</xsd:simpleType>

<!-- declare an element schema using the Percentage datatype -->
<xsd:element name="AVCommand">
  <xsd:complexType>
    <xsd:attribute name="volume">
       <xsd:simpleType>
          <xsd:restriction base="Percentage">
            <xsd:minInclusive value="12%" />
            <xsd:maxInclusive value="45%" />
          </xsd:restriction>
       </xsd:simpleType>
    </xsd:attribute>
  </xsd:complexType>
</xsd:element>

<!-- An actual element as defined in an XML file  -->
<AVCommand volume="25%"/>


The clarity, definitiveness, and simple readability for someone working
with actual XML files using the above syntax allows XML programmers to
more easily self-document the use of the applicable Schema's in their
specific application.  In addition, the attribute can be fully validated
for a float value within the inclusive range and that the required units
measurement is in place.  

Obviously, this example could easily be expanded to include a number of
different units of measure or even go beyond the units of measure
paradigm.  Restricted string attributes could be built up using
different enumerated string simpleTypes in different positions to ensure
a particular attribute order in a single string -- something I believe
is also a point of discussion regarding the XML Schema specification.  

I would also expect that any parser worthy of handling regular
expressions as they are currently defined should be able to extend
itself to handling this new syntax with a minimum of effort.

Surely the two above examples of a use for this syntax are not the only
cases that can benefit from a regular expression syntax that would allow
separate atoms to be validated against specific datatypes.

Is there merit to allowing a datatype to be specified as an atom in a
regular expression?

-- 
Steve Rosenberry
Sr. Partner

Electronic Solutions Company -- For the Home of Integration
http://ElectronicSolutionsCo.com

http://BetterGoBids.com -- The Premier GoTo Bid Management Tool

(610) 670-1710

Received on Thursday, 29 March 2001 14:09:04 UTC