W3C home > Mailing lists > Public > public-qt-comments@w3.org > August 2005

[Bug 1850] [F&O] how do ranges work in case-insensitive mode?

From: <bugzilla@wiggum.w3.org>
Date: Wed, 31 Aug 2005 20:58:21 +0000
To: public-qt-comments@w3.org
Cc:
Message-Id: <E1EAZf3-0002LS-34@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1850





------- Additional Comments From liam@w3.org  2005-08-31 20:58 -------
The examples from Mike Kay's comment,
    matches('G','[A-Z-[f-h]]','i')
and matches('G','[A-Z-[F-H]]','i')
are not well-formed in Perl: the operands of "-" must
be a character, not a range.  Perl does not support
range subtraction directly (see below)...

So, [A-Z-[f-h]] ends up matching the literal [f-h]
and nothing else as far as I can tell.

the example
    matches('G','[A-Z-[F-Hf-h]]','i')
is the same, matching the literal string [F-Hf-h]
(I don't think it's specified that it works this
way, so it's a bug that Perl doesn't trap this case
I think)

The example
    matches('G','[^F-H]','i')
does not match in Perl, neither with nor without the /i

Note that the pattern [A-Z] might or might not match both
a and z: a common collation order on Linux at least for case
insensitive matching is aAbBcCdD...zZ, so A-Z excludes the "a".
This doesn't affect Perl by default, as it uses unicode codepoints
unless you put
    use locale;
in your Perl script (see man pages for perlre and perllocale,
or run "perldoc perlre" to see them...)
"G" does not match /[^G]/i in Perl

Perl's nearest equivalent for range subtraction is the
zero-width negative lookahead assertion, (?!e), which matches
only if it is not immediately followed by something that
matches the contained expression e.  Hence,
/(?![f-h])[A-Z]/i
matches b and w but not g or G.
 
I think the real question here is whether a range can introduce
or exclude unexpected characters when case insensitive.  I experimented,
but the version of Perl I'm using doesn't like ranges in character classes
if they are above codepoint 127 decimal for some reason, although it's
otherwise 8-bit clean, and can match explicit characters in classes.
Received on Wednesday, 31 August 2005 20:58:25 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:07 UTC