W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > October to December 2002

Re: [AndrewWatt2000@aol.com] Internationalising Regular Expressions

From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
Date: 04 Nov 2002 10:32:45 +0000
To: www-xml-schema-comments@w3.org
Message-ID: <f5br8e1vd4i.fsf@erasmus.inf.ed.ac.uk>

ht@cogsci.ed.ac.uk (Henry S. Thompson) writes:

> From: AndrewWatt2000@aol.com
> Subject: Internationalising Regular Expressions
> To: xml-dev@lists.xml.org, xmlschema-dev@w3.org, www-forms@w3.org
> Date: Wed, 30 Oct 2002 05:30:05 EST
> Gnus-Warning: This is a duplicate of message <ad.258e4e54.2af10ead@aol.com>
> Resent-From: xmlschema-dev@w3.org
> 
> 
> This question arises, in part, out of thinking how XForms may handle derived 
> datatypes. As the likely audience of English language based Web sites widens, 
> capturing and validating text data entered in forms, including XForms, 
> becomes a little more complex.
> 
> XForms uses the W3C XML Schema <xsd:pattern> element to provide derivation by 
> restriction.
> 
> Suppose, in English, we want to limit characters to "word characters" we 
> might write something like,
> <xsd:pattern value="[A-Za-z0-9_]" />
> 
> In JavaScript, for example, I might write [A-Za-z0-9_]  more succinctly as 
> \w.
> 
> However, if I understand Appendix F.1.1 of W3C XML Schema Part 2 the \w 
> metacharacter now covers Unicode [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}].
> 
> So, <xsd:pattern value="\w" /> would match many (unwanted) characters that <
> xsd:pattern value="[A-Za-z0-9_] /> would reject as non-matching. Correct?

Unwanted by whom?  XML uses Unicode, the W3C XML Schema pattern \w
matches Unicode word characters, as near as possible.

> A couple of other questions arising from this tentative understanding.
> 
> In W3C XML Schema, and therefore in XForms, is it correct that the only way 
> to express the notion of an English language / ASCII "word character" in a 
> regular expression is using [A-Za-z0-9_]? Or, in other words, is there a 
> metacharacter which corresponds to the JavaScript use of \w?

No.

> Is there any facility to express the notion of, for example, a French word 
> character? Or German? Or is the \p{Basic_Latin} the smallest / most precise 
> "chunk" of characters that can be used in such a setting?

I presume so.  Patterns are fundamentally about characters, not
languages -- you can approximate one with the other, but it's
dangerous to try -- do you really want to rule out, for example,
coöperation and being naïve, to say nothing of going to Montréal or
Bogotá?

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2002, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/
 [mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Monday, 4 November 2002 05:32:47 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 6 December 2009 18:13:01 GMT