- From: Biron,Paul V <Paul.V.Biron@kp.org>
- Date: Thu, 8 Jun 2000 15:25:08 -0700
- To: "'Michael Fitzgerald'" <mike@wyeast.net>
- Cc: www-xml-schema-comments@w3.org
> -----Original Message----- > From: Michael Fitzgerald [SMTP:mike@wyeast.net] > Sent: Wednesday, June 07, 2000 4:14 PM > To: xmlschema-dev@w3.org > Subject: [Moderator Action] regular expressions > > I have been looking at the schema specs for information on reg ex support, > i.e., app. c in primer, and app. e in datatypes, and it is not clear to me > to what extent you will support Mark Davis' document on Unicode reg exps. > Specifically, I was looking for some assurance that you will support \u. I > assume you will. What will help is a list of what you will NOT support or > a > pointer to something I missed in the specs. > The regular expression language described in Appendix E of the datatypes spec supports all of Level 1 and some of Level 2 in Mark's document (officially, the Unicode Regular Expression Tech Report #18 [1]) (although support for section 2.6 needs to be spelled out explicitly). Level 1 is described in section 2 of TR #18, which Level 2 is described in section 3 (all mention of section numbers in thise message refer to TR #18 and not to sections of the datatypes spec). The syntax used in the TR #18 is not necessarily intended to be adopted by regex languages and is used only for examples within the TR. In particular, the \uhhhh syntax (section 2.1) for including an arbitrary Unicode codepoint in a regex is one place where we felt there was a "better" syntax available to us. Since schema documents are XML documents, arbitrary Unicode codepoints can be identified with character references (i.e., &#hhhh;). Note also, that this mechanism gets us support for Surrogates (section 3.1) "for free". Some of us on the WG had wanted to provide explicit support for Canonical Equivalents (section 3.2) but in the end the WG decided that to require such support at this stage would be too complex for implementors. Note, however, if both your schema and instance documents are normalized according to the W3C CharModel WD [3], then you also get Canonical Equivs "for free". Support for locale-independent graphemes (section 3.3) and locale-independent loose matches (section 3.5) was felt to be too complex/expensive to require for schema V1...if there is a "ground swell" of requests for it, we might consider suppporting in V2. I personally think we should add support for locale-independent words (section 3.3) given that it is very straightforward to implement, but we left it out at this time to keep the spec as simple as possible. I would like to see a 'ground swell' of requests for this for V1 (I have a background in text indexing and retrieval, where word tokenization is very important and our current \w and \W multi-character escapes are not enough), but I'm not going out on a limb for this one. We purposefully chose NOT to support any of the Local-Dependant features (section 4), and I believe that is the correct decision. I hope that helps. pvb [1] http://www.unicode.org/unicode/reports/tr18/ [2] http://www.unicode.org/unicode/reports/tr15 [3] http://www.w3.org/TR/charmod/#Normalization
Received on Thursday, 8 June 2000 18:48:05 UTC