Re: Comments on regex-opt from Joel Yliluoma on 2006-01-09 (www-archive@w3.org from January 2006)

From: Joel Yliluoma <bisqwit@iki.fi>
Date: Mon, 9 Jan 2006 14:46:42 +0200 (EET)
To: Bjoern Hoehrmann <derhoermi@gmx.net>
cc: www-archive@w3.org
Message-ID: <Pine.LNX.4.62.0601091437250.26430@winnie>

On Mon, 9 Jan 2006, Bjoern Hoehrmann wrote:
> I tried http://bisqwit.iki.fi/source/regexopt.html and so far I like
> it! Thanks for doing this. I noticed some issues though: in GetDecMask
> it would probably be better to call the set() method rather than using
> the operator[] reference.

Thank you for your feedback!

Your feedback was very useful, but I fear I lack the expertise
required to make regexps work with unicode. I've created some
character encoding -related software, but I don't have expertise
on locales and perl specifically.

I would appreciate it, if you can provide a crash-course on
how unicode works _with regexps_, and I can then look at it.
Most importantly, what is the proper way to implement \w and its cousins.

I already know how UTF-8 works and what kind of characters the unicode
consists of (http://bisqwit.iki.fi/japtools/unicodemap.php), but I realize
that regexps aren't necessarily always UTF-8 -encoded. I've written plenty
of ISO-8859-* -encoded regexps, which would fail parsing as UTF-8.

Also, I'm interested of your unicode bitset. I could easily use
std::bitset<0x110000> instead of std::bitset<0x100>, but then it
would use 139264 bytes of memory per instance instead of 32, which
wouldn't be so nice...

Creating a lib is a possibility and a good idea. I'll probably do it
in the next version.

-- 
Joel Yliluoma
http://iki.fi/bisqwit/

Received on Tuesday, 10 January 2006 09:45:49 UTC