Re: [xml-dev] Internationalising Regular Expressions from John Cowan on 2002-10-30 (xmlschema-dev@w3.org from October 2002)

From: John Cowan <jcowan@reutershealth.com>
Date: Wed, 30 Oct 2002 06:58:11 -0500 (EST)
To: AndrewWatt2000@aol.com
Cc: xml-dev@lists.xml.org, xmlschema-dev@w3.org, www-forms@w3.org
Message-Id: <200210301211.HAA10571@mail2.reutershealth.com>

AndrewWatt2000@aol.com scripsit:

> So, <xsd:pattern value="\w" /> would match many (unwanted) characters that <
> xsd:pattern value="[A-Za-z0-9_] /> would reject as non-matching. Correct?

Definitely.

> In W3C XML Schema, and therefore in XForms, is it correct that the only way 
> to express the notion of an English language / ASCII "word character" in a 
> regular expression is using [A-Za-z0-9_]? 

Correct.

> Is there any facility to express the notion of, for example, a French word 
> character? Or German?

You'd have to concoct a similar character class, and there is always
a measure of controversy about these things.  The standard English spellings of
"naïve" and "façade" require letters outside [A-Za-z], and so does
one spelling of "coöperate".

> Or is the \p{Basic_Latin} the smallest / most precise 
> "chunk" of characters that can be used in such a setting?

That certainly doesn't do what you want: it matches any ASCII character,
rejecting the non-ASCII ones.

-- 
We call nothing profound                        jcowan@reutershealth.com
that is not wittily expressed.                  John Cowan
        --Northrop Frye (improved)              http://www.reutershealth.com

Received on Wednesday, 30 October 2002 07:00:17 UTC