Re: Comments on regex-opt from Bjoern Hoehrmann on 2006-01-09 (www-archive@w3.org from January 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 09 Jan 2006 14:59:25 +0100
To: Joel Yliluoma <bisqwit@iki.fi>
Cc: www-archive@w3.org
Message-ID: <1bn4s1dmcceeejlfr17b8duqmalhu5ppmd@hive.bjoern.hoehrmann.de>
* Joel Yliluoma wrote:
>Your feedback was very useful, but I fear I lack the expertise
>required to make regexps work with unicode. I've created some
>character encoding -related software, but I don't have expertise
>on locales and perl specifically.
>
>I would appreciate it, if you can provide a crash-course on
>how unicode works _with regexps_, and I can then look at it.
>Most importantly, what is the proper way to implement \w and its cousins.
>
>I already know how UTF-8 works and what kind of characters the unicode
>consists of (http://bisqwit.iki.fi/japtools/unicodemap.php), but I realize
>that regexps aren't necessarily always UTF-8 -encoded. I've written plenty
>of ISO-8859-* -encoded regexps, which would fail parsing as UTF-8.
>
>Also, I'm interested of your unicode bitset. I could easily use
>std::bitset<0x110000> instead of std::bitset<0x100>, but then it
>would use 139264 bytes of memory per instance instead of 32, which
>wouldn't be so nice...

Well, the basic idea here is to have a class that stores characters and
character ranges as list of min,max pairs; so if you have [A-Z0-9] that
would become a list with two pairs ('A' .. 'Z', '0' .. '9'), and if it's
a single character 'X' you'd have a list with one pair ('X' .. 'X').

The class would then have all the relevant methods the current bitset
has like flip(); if you have 'X' like above and flip(), you'd replace
the ('X' .. 'X') list by (0 .. 'W', 'Y' .. 0x10FFFF). Just as you'd ex-
pect.

Another example, dumping a character range like the DumpKey() function
would do would then become something like

  for each pair(min,max) in tmp.list
  {
    n += max - min + 1;
    sets += EscapeChar(min);

    if (min != max)
    {
      need_set = true;
      sets += '-';
      sets += EscapeChar(max);
    }
  }

If min,max can store 32 bit values you are pretty close to good Unicode
support. The remaining problem would be getting the Unicode code points
in and out of the tool. A good first step is to allow \x{....} escapes,
e.g. \x{20AC} would refer to the Euro currency symbol. I simply made a
parser for this notation, and with the charset class as described above
you'd store that simply by

  charset c;
  c.set(0x20AC);

Now that charset can hold values > 255 the EscapeChar() function needs
to be changed to take care of that. The easiest way to do that is to
simply generate the \x{....} sequence again for anything > 255. You can
then roundtrip regular expressions like [\x{20AC}-\x{3000}], which is
most of what I need.

You'd probably change some parts though for performance reasons, e.g.
the dump function should be changed as outlined above so it does not
spend all the time looking for set bits (code points really), and you'd
use a new function like set_range(min,max) instead of set() in a loop.

Of course, \x{....} is not the only way to specify characters, you might
want to use the tool like

  % regex-opt Bj[ö]rn

i.e., specify the characters directly without any escaping. If the
string is e.g. in ISO-8859-1 the 'ö' would be only a single byte and the
code would (continue) to work as expected. The problem with ISO-8859-1
and other encodings is that you can't use more than 256 characters. If
the above is in UTF-8 the 'ö' would be two bytes, 0xc3 0xb6,

  % regex-opt Bj[< first byte 0xC3 >< second byte 0xB6 >]rn

>From the perspective of the regular expression this would be the same as

  % regex-opt Bj[< first byte 0xB6 >< second byte 0xC3 >]rn

Which is very bad since that would no longer match my name. To prevent
this you would need to know the character encoding (here: UTF-8) and
turn the bytes in the input into 'characters'. With the class above this
would mean to turn the two bytes 0xc3 0xb6 into a single code point
U+00F6 and store that as a list of one pair (0x00F6 .. 0x00F6) in the
class as described above.

Knowing the encoding and decoding the bytes isn't so easy, and I'm not
sure adding this would be worth the effort, converting the relevant
characters in a regular expression to the \x{....} format would be as
easy as

  s/([\xFF-\x{10FFFF}])/"\\x{".sprintf('%x',ord"$1")."}"/eg;

in Perl. So my recommendation would be to make the class and support the
\x{....} syntax and see what other people think about it.

Regarding the special escapes, an expression like ^\w+$ may or may not
match "Björn"; in Perl this depends on the current locale; if I use the
german locale, it would match, with the default locale it would not. In
addition to the locale, Perl might be in Unicode mode; in this case \w
is defined in terms of Unicode character classes; that is, if Unicode
considers a certain code point a letter, \w will match that (which means
thousands of different characters). 

For Perl, http://perldoc.perl.org/perlunicode.html has many of the
details. The problem with mapping one character range to such an escape
is that you'd change the regular expression, if the input is [0-9] I
really mean exactly that; if the tool turns that into \d the regular
expression would also match e.g. EXTENDED ARABIC-INDIC DIGIT THREE in
addition to [0-9] under certain circumstances. Worse, what \d matches
in this case depends on the version of Unicode the regex engine supports
as there may be new digits in the next version.

As I said, I avoid those escapes in the input and it's easy to disable
generation of such sequences in regex-opt (a switch would be nice
though!), so this isn't really an issue for me. I think for most uses
the current mapping is fine, if you make the switch that'd be cool,
other than that I would recommend to simply see what other users think.

What's important to me is mostly the new charset class and support for
\x{....} escapes in in- and output.

HTH,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 9 January 2006 13:58:58 UTC