Re: regular expression in XML Schema from Jeni Tennison on 2003-09-25 (xmlschema-dev@w3.org from September 2003)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Thu, 25 Sep 2003 17:28:21 +0100
To: Hans Teijgeler <hans.teijgeler@quicknet.nl>
Cc: xmlschema-dev@w3.org
Message-ID: <92509044206.20030925172821@jenitennison.com>
Hi Hans,

>   1. I still need some document in which the whole subject of the Regualar
>      Expressions in XML Schema is explained. I read through the concept book of
>      Eric van der Vlist
>      (http://books.xmlschemata.org/relaxng/RngBookWxsRegExp.html ) but that book
>      assumes that I know much more than I do. I need something that starts at
>      zero, for dummies, with MANY examples. Any suggestions?

Perhaps you should start off with something that addresses regular
expressions more generally? A search for "regular expression tutorial"
in Google comes up with a bunch of promising leads; many of them are
written for Perl or Python, but don't let that put you off: the
regular expression syntax in XML Schema is *fairly* standard, at least
for the simple things.

>   2. What is a "combiningchar" and what an "extender"? It is being talked about
>      in XML as being an allowable part of Namechar, but nowhere I can find what
>      it really IS and what it is used for. You guys/gals must have read
>      something that I haven't, so apparently you know it (if not, why didn't you
>      ask or complain?)

I assume that you've been looking at the XML Recommendation and found
these. In XML terms, a "CombiningChar" is defined as one of the
characters listed at:

  http://www.w3.org/TR/REC-xml#NT-CombiningChar

and an "Extender" is one of the characters listed at:

  http://www.w3.org/TR/REC-xml#NT-Extender

In more abstract terms, combining characters and extenders are
particular kinds of character as defined in Unicode. They are both
kinds of characters that combine with preceding characters, creating
different glyphs when you view a string.

Combining characters are characters that add things like accents to
preceding characters; for example, the character COMBINING RING ABOVE
#x030A is a combining character; when you combine it with the
character 'a' you see 'å'.

Extenders are characters that extend the shape of preceding
characters; for example, the character MIDDLE DOT #x00B7 is an
extender; when you combine it with the character 'L' you see '?'
(which if it doesn't show up in your font is a L with a dot in the
middle of the glyph).

If you really want to know more, immerse yourself in www.unicode.org.
Personally, I found the most valuable information there concerning
combining characters and extenders was the explanation of Unicode
normalization, which you can find at:

  http://www.unicode.org/reports/tr15

>   3. I want to separate the first part of the identifier
>      ([a-zA-Z][a-zA-Z0-9-]*__)*[a-zA-Z0-9.-]+  from the second (optional) part
>      ([a-zA-Z0-9.-]+)? by means of a character that normally isn't used in
>      system identifiers. So I chose the "middle dot" (#x00B7). I have three
>      questions:
>        1. Is the way it has now been introduced in the above RegEx correct?

Yes, that's fine, since you're using it in an XML document. You're
using an XML character reference (&#x00B7;). This is interpreted when
the XML Schema document is parsed; as far as the application (the
schema validator) is concerned, the regular expression actually
includes the MIDDLE DOT character itself.

You will probably run into problems if you use that syntax in a
regular expression that *isn't* held in an XML document, however. So
if you're using the Regex Coach, for example, you need to use a
different kind of escaping to include the character. I think that
\u00B7 might work...

>        2. If I make an XML document based on an XML Schema (e.g. in Spy), how
>           can I fill in such a middle dot as part of a Name? I have tried
>           everything I could think of, but with no success

Where does this Name appear? If it's in the value of an attribute or
in text within an element, then you can use the character reference
&#x00B7;. Again, this character reference will be interpreted when the
document is parsed and as far as the application can tell it's
precisely the same as inserting the MIDDLE DOT character literally in
the attribute or text.

If you're using the identifier as the name of an element or attribute,
then you can't use the character reference and have to insert the
character literally in the XML document. If you're using Windows, you
can do this using the Character Map utility or by typing Alt+0183 (on
the numeric keypad).

>        3. In how far does the font type play a role? I found a middle dot in the
>           Windows Character Map under Trebuchet MS (called U+00B7 Middle Dot),
>           but Spy didn't accept that

The font determines whether a glyph is available for a particular
character or not: if a font doesn't have a glyph for a character, you
might see a question mark or an empty box or something instead of the
actual character. (You should beware of the fact that some fonts use
glyphs for particular characters that are completely unrelated to what
the character actually is: that's most obviously the case with the
various Wingdings fonts, for example.)

When you use the Character Map to select a character, it shouldn't
make any difference what font you use when selecting the character; if
the character isn't available in the font that you're using where you
*paste* the character, you'll get the question mark or empty box
appear.

I'm not sure what XML Spy did when you tried to use that character --
what "didn't accept that" actually means. If you provide more
information about what you tried and what error XML Spy gave you, we
might be able to help.

FWIW, I found Mike Brown's "XML Tutorial", which focuses on issues of
character encoding and so on, really helpful in getting me to
understand how characters work in XML. You can find it at:

  http://skew.org/xml/tutorial/

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/
Received on Thursday, 25 September 2003 12:28:55 UTC