Re: Comments on regex-opt from Bjoern Hoehrmann on 2006-01-10 (www-archive@w3.org from January 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 10 Jan 2006 09:27:21 +0100
To: Joel Yliluoma <bisqwit@iki.fi>
Cc: www-archive@w3.org
Message-ID: <bgr6s1hgtn4lokhvjuaiav9l34a10nnivs@hive.bjoern.hoehrmann.de>

* Joel Yliluoma wrote:
>> The remaining problem would be getting the Unicode code points
>> in and out of the tool. A good first step is to allow \x{....} escapes,
>
>And, \unnnn escapes as well.

Perl does not support those and they are a bit tricky; they are
typically supported in environments that use so-called 16 Bit
Unicode; if you want to refer to characters > U+FFFF you have to
use surrogate code points to do it (that is, \uXXXX\uXXXX maps
to a single character). Iconv might support this as "java" en-
coding though, and the conversion is trivial.

>I believe I have figured this out now. But it needs an extra
>commandline option that will tell which character set the input
>string is assumed to be in. (For starters, utf-8 and iso-8859-1
>would be good options, but full support of iconv would be better
>and not much harder to implement.)

Indeed. Supporting encodings like ISO-8859-1 is a bit tricky for
the output though as they don't encode all characters, you'd have
to use \x{....} for characters that cannot be encoded. Last time
I check it wasn't so easy to ask iconv whether a specific character
can be encoded. I personally would just support UTF-8 and ask the
user to transcode if needed, `iconv -f ... -t utf-8 | regex-opt -`
isn't so hard to type :-)
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Tuesday, 10 January 2006 08:33:28 UTC