RE: errata/clarification for regex language from Biron,Paul V on 2001-07-26 (www-xml-schema-comments@w3.org from July to September 2001)

From: Biron,Paul V <Paul.V.Biron@kp.org>
Date: Thu, 26 Jul 2001 15:08:04 -0700
To: "'www-xml-schema-comments'" <www-xml-schema-comments@w3.org>
Message-Id: <25BE5EAE9FAED4119A1100805FD42A84BC8BB6@gren-exch-2.ca.kp.org>

> -----Original Message-----
> From:	Biron,Paul V [SMTP:Paul.V.Biron@kp.org]
> Sent:	Wednesday, July 25, 2001 4:41 PM
> To:	'www-xml-schema-comments'
> Subject:	errata/clarification for regex language
> 
> It appears that we were not explicit enough in our description of the
> regex
> language in Appendix F [1].
> 
> Our intension was to follow exactly 2 aspects of Perl's matching
> algorithm:
> 
> 1. the "earliest" match wins...that is, since the string is scanned
> left-to-right, the match that begins closest to the start of the string
> "wins"
> 2. the "greediest" match wins...that is, the longest substring that can
> possibly match (given #1 above) wins.
> 
> I think this cna be considered a clarification rather than a change, but
> will leave it up to the WG  (I'm especially interested in hearing from
> implementors to see if they have implemented something different).
> 
I also left off one other item.

It should be clear (famous last words :-) that the way we have defined
regex's there is an implicit "head and tail" anchoring added to every
pattern.

That is, every pattern p in our language is equivalent to the pattern ^p$ in
Perl or other similar regex languages.  We (the task force that designed the
regex language) made this decision very conciously, since we felt that in
ALMOST EVERY concievable case, someone using pattern to restrict the lexical
space of a type would want the implicit anchoring...and we felt that it
would be burdonsome for them to add the extra metacharacters.

However, it probably wouldn't hurt to add a note to this effect, possibly
with an example of how to get the "substring" matching behavior that is the
default in perl (i.e., instead of p, one would write .*p.*).

pvb

Received on Thursday, 26 July 2001 18:30:40 UTC