- From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
- Date: 04 Nov 2002 10:32:45 +0000
- To: www-xml-schema-comments@w3.org
ht@cogsci.ed.ac.uk (Henry S. Thompson) writes:
> From: AndrewWatt2000@aol.com
> Subject: Internationalising Regular Expressions
> To: xml-dev@lists.xml.org, xmlschema-dev@w3.org, www-forms@w3.org
> Date: Wed, 30 Oct 2002 05:30:05 EST
> Gnus-Warning: This is a duplicate of message <ad.258e4e54.2af10ead@aol.com>
> Resent-From: xmlschema-dev@w3.org
>
>
> This question arises, in part, out of thinking how XForms may handle derived
> datatypes. As the likely audience of English language based Web sites widens,
> capturing and validating text data entered in forms, including XForms,
> becomes a little more complex.
>
> XForms uses the W3C XML Schema <xsd:pattern> element to provide derivation by
> restriction.
>
> Suppose, in English, we want to limit characters to "word characters" we
> might write something like,
> <xsd:pattern value="[A-Za-z0-9_]" />
>
> In JavaScript, for example, I might write [A-Za-z0-9_] more succinctly as
> \w.
>
> However, if I understand Appendix F.1.1 of W3C XML Schema Part 2 the \w
> metacharacter now covers Unicode [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}].
>
> So, <xsd:pattern value="\w" /> would match many (unwanted) characters that <
> xsd:pattern value="[A-Za-z0-9_] /> would reject as non-matching. Correct?
Unwanted by whom? XML uses Unicode, the W3C XML Schema pattern \w
matches Unicode word characters, as near as possible.
> A couple of other questions arising from this tentative understanding.
>
> In W3C XML Schema, and therefore in XForms, is it correct that the only way
> to express the notion of an English language / ASCII "word character" in a
> regular expression is using [A-Za-z0-9_]? Or, in other words, is there a
> metacharacter which corresponds to the JavaScript use of \w?
No.
> Is there any facility to express the notion of, for example, a French word
> character? Or German? Or is the \p{Basic_Latin} the smallest / most precise
> "chunk" of characters that can be used in such a setting?
I presume so. Patterns are fundamentally about characters, not
languages -- you can approximate one with the other, but it's
dangerous to try -- do you really want to rule out, for example,
coöperation and being naïve, to say nothing of going to Montréal or
Bogotá?
ht
--
Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
W3C Fellow 1999--2002, part-time member of W3C Team
2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Monday, 4 November 2002 05:32:47 UTC