- From: Henry S. Thompson <ht@cogsci.ed.ac.uk>
- Date: 04 Nov 2002 10:32:45 +0000
- To: www-xml-schema-comments@w3.org
ht@cogsci.ed.ac.uk (Henry S. Thompson) writes: > From: AndrewWatt2000@aol.com > Subject: Internationalising Regular Expressions > To: xml-dev@lists.xml.org, xmlschema-dev@w3.org, www-forms@w3.org > Date: Wed, 30 Oct 2002 05:30:05 EST > Gnus-Warning: This is a duplicate of message <ad.258e4e54.2af10ead@aol.com> > Resent-From: xmlschema-dev@w3.org > > > This question arises, in part, out of thinking how XForms may handle derived > datatypes. As the likely audience of English language based Web sites widens, > capturing and validating text data entered in forms, including XForms, > becomes a little more complex. > > XForms uses the W3C XML Schema <xsd:pattern> element to provide derivation by > restriction. > > Suppose, in English, we want to limit characters to "word characters" we > might write something like, > <xsd:pattern value="[A-Za-z0-9_]" /> > > In JavaScript, for example, I might write [A-Za-z0-9_] more succinctly as > \w. > > However, if I understand Appendix F.1.1 of W3C XML Schema Part 2 the \w > metacharacter now covers Unicode [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]. > > So, <xsd:pattern value="\w" /> would match many (unwanted) characters that < > xsd:pattern value="[A-Za-z0-9_] /> would reject as non-matching. Correct? Unwanted by whom? XML uses Unicode, the W3C XML Schema pattern \w matches Unicode word characters, as near as possible. > A couple of other questions arising from this tentative understanding. > > In W3C XML Schema, and therefore in XForms, is it correct that the only way > to express the notion of an English language / ASCII "word character" in a > regular expression is using [A-Za-z0-9_]? Or, in other words, is there a > metacharacter which corresponds to the JavaScript use of \w? No. > Is there any facility to express the notion of, for example, a French word > character? Or German? Or is the \p{Basic_Latin} the smallest / most precise > "chunk" of characters that can be used in such a setting? I presume so. Patterns are fundamentally about characters, not languages -- you can approximate one with the other, but it's dangerous to try -- do you really want to rule out, for example, coöperation and being naïve, to say nothing of going to Montréal or Bogotá? ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2002, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail really from me _always_ has this .sig -- mail without it is forged spam]
Received on Monday, 4 November 2002 05:32:47 UTC