Comments on regex-opt from Bjoern Hoehrmann on 2006-01-09 (www-archive@w3.org from January 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 09 Jan 2006 04:10:27 +0100
To: bisqwit@iki.fi
Cc: www-archive@w3.org
Message-ID: <a2j3s1plschnjs9jh8ch6aphnccivi557b@hive.bjoern.hoehrmann.de>
Hi Joel,

  I tried http://bisqwit.iki.fi/source/regexopt.html and so far I like
it! Thanks for doing this. I noticed some issues though: in GetDecMask
it would probably be better to call the set() method rather than using
the operator[] reference.

I am not sure about its handling of incomplete escape sequences, e.g.,
Perl does not accept "\c" while regex-opt turns that into \c@; it does
not seem to read out of bounds though...

My regular expressions are very long and Windows does not really allow
for passing thag long parameters, it would be good to have an option to
parse files instead.

MSVC++'s version of the C++ Standard Library does not support the SGI
extension _Find_First() for std::bitsets; there does not seem to be a
good replacement though... It might make sense to mention somewhere in
the source that MSVC++ users should use the SGI STL or roll their own
version of _Find_First().

Unicode! I really need this. I played a bit with the code and it seems
easy to add. I replaced the input std::string with a 32 bit string as
std::basic_string<unsigned, ...>, made a charset class that behaves like
a bitset but copes with U+0000-U+10FFFF by storing the ranges rather
than the bits; and then changed the EscapeChar function to sprintf to a
new format like \\x{%04X} along with the loops that call it to consider
the whole unicode range; my code is a mess though, so I'm afraid I can't
contribute a patch for this.

You probably know that the varios escape sequences like \d and \w at
least in Perl no longer correspond to a specific range of characters
but are based on Unicode character classes; so decoding and generating
expressions with these in requires some care so that things won't break.
I don't use these in the input though and changing regex-opt such that
it won't generate them is easy enough, so this is not a big issue for
me. I am also not sure how the code could be improved in this regard,
other than provinding a runtime option to disable generation of them.

Having a standalone library for this would be very nice, it's easy
enough to derive one from the code but not so easy when writing a
wrapper for the library (e.g., for use in Perl! :-) as you'd have to
include the code with the wrapper... Not a big issue either though.

Any chance you could look into some of these things? I'd would be really
good if at least the internal architecture of it was based on Unicode
code points rather than bytes and if it had UTF-8 support or at least
support \x{...} style hex escapes, otherwise it would always break my
regular expressions (which are Unicode-heavy) rendering the tool useless
for me. I'd be glad to help, but your C++ is a bit better than mine...

(Note, I cc'd www-archive which is a publicly archived mailing list; if
you don't want your message to appear at http://lists.w3.org/Archives/
Public/www-archive/ you better remove it from the cc-line :-)

Thanks,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 9 January 2006 03:10:03 UTC