Re: regular expression in XML Schema from Hans Teijgeler on 2003-09-25 (xmlschema-dev@w3.org from September 2003)

From: Hans Teijgeler <hans.teijgeler@quicknet.nl>
Date: Thu, 25 Sep 2003 22:43:37 +0200
To: Jeni Tennison <jeni@jenitennison.com>
Cc: xmlschema-dev@w3.org, "weitz, edi" <edi@agharta.de>, "paap, onno" <onno.paap@ezzysurf.com>
Message-id: <3F735379.E6CC3866@quicknet.nl>
Dear Jeni,

Thank you so much for your extensive and thorough reply!

You asked for more information regarding the behaviour of Spy, and therefore I made a
very simple XML schema called middle-dot-test.xsd:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"
attributeFormDefault="unqualified">
   <xs:simpleType name="xyz">
      <xs:restriction base="xs:Name">
         <xs:pattern
value="([a-zA-Z][a-zA-Z0-9-]*__)*[a-zA-Z0-9\.\-]+(&#x00B7;[a-zA-Z0-9\.\-]+)?"/>
      </xs:restriction>
   </xs:simpleType>
   <xs:element name="test">
      <xs:complexType>
         <xs:attribute name="abc" type="xyz" use="required"/>
      </xs:complexType>
   </xs:element>
</xs:schema>

(I deliberately used phony names to keep it generic)

and then derived an XML document from that. In that document I entered an identifier
with the Trebuchet MS middle dot:

<?xml version="1.0" encoding="UTF-8"?>
<test xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="D:\middle-dot-test.xsd" abc="ERDL__1234·a8"/>

The error message after validation is then:

This file is not valid
Invalid value for datatype Name in attribute 'identifier'

Question is: where are things going wrong? I hope you can help me out.

Regards,
Hans

======================================

Jeni Tennison wrote:

> Hi Hans,
>
> >   1. I still need some document in which the whole subject of the Regualar
> >      Expressions in XML Schema is explained. I read through the concept book of
> >      Eric van der Vlist
> >      (http://books.xmlschemata.org/relaxng/RngBookWxsRegExp.html ) but that book
> >      assumes that I know much more than I do. I need something that starts at
> >      zero, for dummies, with MANY examples. Any suggestions?
>
> Perhaps you should start off with something that addresses regular
> expressions more generally? A search for "regular expression tutorial"
> in Google comes up with a bunch of promising leads; many of them are
> written for Perl or Python, but don't let that put you off: the
> regular expression syntax in XML Schema is *fairly* standard, at least
> for the simple things.
>
> >   2. What is a "combiningchar" and what an "extender"? It is being talked about
> >      in XML as being an allowable part of Namechar, but nowhere I can find what
> >      it really IS and what it is used for. You guys/gals must have read
> >      something that I haven't, so apparently you know it (if not, why didn't you
> >      ask or complain?)
>
> I assume that you've been looking at the XML Recommendation and found
> these. In XML terms, a "CombiningChar" is defined as one of the
> characters listed at:
>
>   http://www.w3.org/TR/REC-xml#NT-CombiningChar
>
> and an "Extender" is one of the characters listed at:
>
>   http://www.w3.org/TR/REC-xml#NT-Extender
>
> In more abstract terms, combining characters and extenders are
> particular kinds of character as defined in Unicode. They are both
> kinds of characters that combine with preceding characters, creating
> different glyphs when you view a string.
>
> Combining characters are characters that add things like accents to
> preceding characters; for example, the character COMBINING RING ABOVE
> #x030A is a combining character; when you combine it with the
> character 'a' you see 'å'.
>
> Extenders are characters that extend the shape of preceding
> characters; for example, the character MIDDLE DOT #x00B7 is an
> extender; when you combine it with the character 'L' you see '?'
> (which if it doesn't show up in your font is a L with a dot in the
> middle of the glyph).
>
> If you really want to know more, immerse yourself in www.unicode.org.
> Personally, I found the most valuable information there concerning
> combining characters and extenders was the explanation of Unicode
> normalization, which you can find at:
>
>   http://www.unicode.org/reports/tr15
>
> >   3. I want to separate the first part of the identifier
> >      ([a-zA-Z][a-zA-Z0-9-]*__)*[a-zA-Z0-9.-]+  from the second (optional) part
> >      ([a-zA-Z0-9.-]+)? by means of a character that normally isn't used in
> >      system identifiers. So I chose the "middle dot" (#x00B7). I have three
> >      questions:
> >        1. Is the way it has now been introduced in the above RegEx correct?
>
> Yes, that's fine, since you're using it in an XML document. You're
> using an XML character reference (&#x00B7;). This is interpreted when
> the XML Schema document is parsed; as far as the application (the
> schema validator) is concerned, the regular expression actually
> includes the MIDDLE DOT character itself.
>
> You will probably run into problems if you use that syntax in a
> regular expression that *isn't* held in an XML document, however. So
> if you're using the Regex Coach, for example, you need to use a
> different kind of escaping to include the character. I think that
> \u00B7 might work...
>
> >        2. If I make an XML document based on an XML Schema (e.g. in Spy), how
> >           can I fill in such a middle dot as part of a Name? I have tried
> >           everything I could think of, but with no success
>
> Where does this Name appear? If it's in the value of an attribute or
> in text within an element, then you can use the character reference
> &#x00B7;. Again, this character reference will be interpreted when the
> document is parsed and as far as the application can tell it's
> precisely the same as inserting the MIDDLE DOT character literally in
> the attribute or text.
>
> If you're using the identifier as the name of an element or attribute,
> then you can't use the character reference and have to insert the
> character literally in the XML document. If you're using Windows, you
> can do this using the Character Map utility or by typing Alt+0183 (on
> the numeric keypad).
>
> >        3. In how far does the font type play a role? I found a middle dot in the
> >           Windows Character Map under Trebuchet MS (called U+00B7 Middle Dot),
> >           but Spy didn't accept that
>
> The font determines whether a glyph is available for a particular
> character or not: if a font doesn't have a glyph for a character, you
> might see a question mark or an empty box or something instead of the
> actual character. (You should beware of the fact that some fonts use
> glyphs for particular characters that are completely unrelated to what
> the character actually is: that's most obviously the case with the
> various Wingdings fonts, for example.)
>
> When you use the Character Map to select a character, it shouldn't
> make any difference what font you use when selecting the character; if
> the character isn't available in the font that you're using where you
> *paste* the character, you'll get the question mark or empty box
> appear.
>
> I'm not sure what XML Spy did when you tried to use that character --
> what "didn't accept that" actually means. If you provide more
> information about what you tried and what error XML Spy gave you, we
> might be able to help.
>
> FWIW, I found Mike Brown's "XML Tutorial", which focuses on issues of
> character encoding and so on, really helpful in getting me to
> understand how characters work in XML. You can find it at:
>
>   http://skew.org/xml/tutorial/
>
> Cheers,
>
> Jeni
>
> ---
> Jeni Tennison
> http://www.jenitennison.com/
Received on Thursday, 25 September 2003 16:40:06 UTC