Re: Comments on regex-opt

* Joel Yliluoma wrote:
>> The remaining problem would be getting the Unicode code points
>> in and out of the tool. A good first step is to allow \x{....} escapes,
>
>And, \unnnn escapes as well.

Perl does not support those and they are a bit tricky; they are
typically supported in environments that use so-called 16 Bit
Unicode; if you want to refer to characters > U+FFFF you have to
use surrogate code points to do it (that is, \uXXXX\uXXXX maps
to a single character). Iconv might support this as "java" en-
coding though, and the conversion is trivial.

>I believe I have figured this out now. But it needs an extra
>commandline option that will tell which character set the input
>string is assumed to be in. (For starters, utf-8 and iso-8859-1
>would be good options, but full support of iconv would be better
>and not much harder to implement.)

Indeed. Supporting encodings like ISO-8859-1 is a bit tricky for
the output though as they don't encode all characters, you'd have
to use \x{....} for characters that cannot be encoded. Last time
I check it wasn't so easy to ask iconv whether a specific character
can be encoded. I personally would just support UTF-8 and ask the
user to transcode if needed, `iconv -f ... -t utf-8 | regex-opt -`
isn't so hard to type :-)
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 

Received on Tuesday, 10 January 2006 08:33:28 UTC