RE: [Michael Fitzgerald <mike@wyeast.net>] [Moderator Action] regular expressions from Biron,Paul V on 2000-06-08 (www-xml-schema-comments@w3.org from April to June 2000)

From: Biron,Paul V <Paul.V.Biron@kp.org>
Date: Thu, 8 Jun 2000 15:25:08 -0700
To: "'Michael Fitzgerald'" <mike@wyeast.net>
Cc: www-xml-schema-comments@w3.org
Message-Id: <376E771642C1D2118DC300805FEAAF4386DC04@pars-exch-1.ca.kp.org>

> -----Original Message-----
> From:	Michael Fitzgerald [SMTP:mike@wyeast.net]
> Sent:	Wednesday, June 07, 2000 4:14 PM
> To:	xmlschema-dev@w3.org
> Subject:	[Moderator Action] regular expressions
> 
> I have been looking at the schema specs for information on reg ex support,
> i.e., app. c in primer, and app. e in datatypes, and it is not clear to me
> to what extent you will support Mark Davis' document on Unicode reg exps.
> Specifically, I was looking for some assurance that you will support \u. I
> assume you will. What will help is a list of what you will NOT support or
> a
> pointer to something I missed in the specs.
> 
The regular expression language described in Appendix E of the datatypes
spec supports all of Level 1 and some of Level 2 in Mark's document
(officially, the Unicode Regular Expression Tech Report #18 [1]) (although
support for section 2.6 needs to be spelled out explicitly).  Level 1 is
described in  section 2 of TR #18, which Level 2 is described in section 3
(all mention of section numbers in thise message refer to TR #18 and not to
sections of the datatypes spec).

The syntax used in the TR #18 is not necessarily intended to be adopted by
regex languages and is used only for examples within the TR.  In particular,
the \uhhhh syntax (section 2.1) for including an arbitrary Unicode codepoint
in a regex is one place where we felt there was a "better" syntax available
to us.  Since schema documents are XML documents, arbitrary Unicode
codepoints can be identified with character references (i.e., &#hhhh;).
Note also, that this mechanism gets us support for Surrogates (section 3.1)
"for free".

Some of us on the WG had wanted to provide explicit support for Canonical
Equivalents (section 3.2) but in the end the WG decided that to require such
support at this stage would be too complex for implementors.  Note, however,
if both your schema and instance documents are normalized according to the
W3C CharModel WD [3], then you also get Canonical Equivs "for free".

Support for locale-independent graphemes (section 3.3) and
locale-independent loose matches (section 3.5) was felt to be too
complex/expensive to require for schema V1...if there is a "ground swell" of
requests for it, we might consider suppporting in V2.

I personally think we should add support for locale-independent words
(section 3.3) given that it is very straightforward to implement, but we
left it out at this time to keep the spec as simple as possible.  I would
like to see a 'ground swell' of requests for this for V1 (I have a
background in text indexing and retrieval, where word tokenization is very
important and our current \w and \W multi-character escapes are not enough),
but I'm not going out on a limb for this one.

We purposefully chose NOT to support any of the Local-Dependant features
(section 4), and I believe that is the correct decision.

I hope that helps.

pvb

[1] http://www.unicode.org/unicode/reports/tr18/
[2] http://www.unicode.org/unicode/reports/tr15
[3] http://www.w3.org/TR/charmod/#Normalization

Received on Thursday, 8 June 2000 18:48:05 UTC