- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 10 Jan 2006 09:27:21 +0100
- To: Joel Yliluoma <bisqwit@iki.fi>
- Cc: www-archive@w3.org
* Joel Yliluoma wrote: >> The remaining problem would be getting the Unicode code points >> in and out of the tool. A good first step is to allow \x{....} escapes, > >And, \unnnn escapes as well. Perl does not support those and they are a bit tricky; they are typically supported in environments that use so-called 16 Bit Unicode; if you want to refer to characters > U+FFFF you have to use surrogate code points to do it (that is, \uXXXX\uXXXX maps to a single character). Iconv might support this as "java" en- coding though, and the conversion is trivial. >I believe I have figured this out now. But it needs an extra >commandline option that will tell which character set the input >string is assumed to be in. (For starters, utf-8 and iso-8859-1 >would be good options, but full support of iconv would be better >and not much harder to implement.) Indeed. Supporting encodings like ISO-8859-1 is a bit tricky for the output though as they don't encode all characters, you'd have to use \x{....} for characters that cannot be encoded. Last time I check it wasn't so easy to ask iconv whether a specific character can be encoded. I personally would just support UTF-8 and ask the user to transcode if needed, `iconv -f ... -t utf-8 | regex-opt -` isn't so hard to type :-) -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Tuesday, 10 January 2006 08:33:28 UTC