- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 09 Jan 2006 14:59:25 +0100
- To: Joel Yliluoma <bisqwit@iki.fi>
- Cc: www-archive@w3.org
* Joel Yliluoma wrote: >Your feedback was very useful, but I fear I lack the expertise >required to make regexps work with unicode. I've created some >character encoding -related software, but I don't have expertise >on locales and perl specifically. > >I would appreciate it, if you can provide a crash-course on >how unicode works _with regexps_, and I can then look at it. >Most importantly, what is the proper way to implement \w and its cousins. > >I already know how UTF-8 works and what kind of characters the unicode >consists of (http://bisqwit.iki.fi/japtools/unicodemap.php), but I realize >that regexps aren't necessarily always UTF-8 -encoded. I've written plenty >of ISO-8859-* -encoded regexps, which would fail parsing as UTF-8. > >Also, I'm interested of your unicode bitset. I could easily use >std::bitset<0x110000> instead of std::bitset<0x100>, but then it >would use 139264 bytes of memory per instance instead of 32, which >wouldn't be so nice... Well, the basic idea here is to have a class that stores characters and character ranges as list of min,max pairs; so if you have [A-Z0-9] that would become a list with two pairs ('A' .. 'Z', '0' .. '9'), and if it's a single character 'X' you'd have a list with one pair ('X' .. 'X'). The class would then have all the relevant methods the current bitset has like flip(); if you have 'X' like above and flip(), you'd replace the ('X' .. 'X') list by (0 .. 'W', 'Y' .. 0x10FFFF). Just as you'd ex- pect. Another example, dumping a character range like the DumpKey() function would do would then become something like for each pair(min,max) in tmp.list { n += max - min + 1; sets += EscapeChar(min); if (min != max) { need_set = true; sets += '-'; sets += EscapeChar(max); } } If min,max can store 32 bit values you are pretty close to good Unicode support. The remaining problem would be getting the Unicode code points in and out of the tool. A good first step is to allow \x{....} escapes, e.g. \x{20AC} would refer to the Euro currency symbol. I simply made a parser for this notation, and with the charset class as described above you'd store that simply by charset c; c.set(0x20AC); Now that charset can hold values > 255 the EscapeChar() function needs to be changed to take care of that. The easiest way to do that is to simply generate the \x{....} sequence again for anything > 255. You can then roundtrip regular expressions like [\x{20AC}-\x{3000}], which is most of what I need. You'd probably change some parts though for performance reasons, e.g. the dump function should be changed as outlined above so it does not spend all the time looking for set bits (code points really), and you'd use a new function like set_range(min,max) instead of set() in a loop. Of course, \x{....} is not the only way to specify characters, you might want to use the tool like % regex-opt Bj[ö]rn i.e., specify the characters directly without any escaping. If the string is e.g. in ISO-8859-1 the 'ö' would be only a single byte and the code would (continue) to work as expected. The problem with ISO-8859-1 and other encodings is that you can't use more than 256 characters. If the above is in UTF-8 the 'ö' would be two bytes, 0xc3 0xb6, % regex-opt Bj[< first byte 0xC3 >< second byte 0xB6 >]rn >From the perspective of the regular expression this would be the same as % regex-opt Bj[< first byte 0xB6 >< second byte 0xC3 >]rn Which is very bad since that would no longer match my name. To prevent this you would need to know the character encoding (here: UTF-8) and turn the bytes in the input into 'characters'. With the class above this would mean to turn the two bytes 0xc3 0xb6 into a single code point U+00F6 and store that as a list of one pair (0x00F6 .. 0x00F6) in the class as described above. Knowing the encoding and decoding the bytes isn't so easy, and I'm not sure adding this would be worth the effort, converting the relevant characters in a regular expression to the \x{....} format would be as easy as s/([\xFF-\x{10FFFF}])/"\\x{".sprintf('%x',ord"$1")."}"/eg; in Perl. So my recommendation would be to make the class and support the \x{....} syntax and see what other people think about it. Regarding the special escapes, an expression like ^\w+$ may or may not match "Björn"; in Perl this depends on the current locale; if I use the german locale, it would match, with the default locale it would not. In addition to the locale, Perl might be in Unicode mode; in this case \w is defined in terms of Unicode character classes; that is, if Unicode considers a certain code point a letter, \w will match that (which means thousands of different characters). For Perl, http://perldoc.perl.org/perlunicode.html has many of the details. The problem with mapping one character range to such an escape is that you'd change the regular expression, if the input is [0-9] I really mean exactly that; if the tool turns that into \d the regular expression would also match e.g. EXTENDED ARABIC-INDIC DIGIT THREE in addition to [0-9] under certain circumstances. Worse, what \d matches in this case depends on the version of Unicode the regex engine supports as there may be new digits in the next version. As I said, I avoid those escapes in the input and it's easy to disable generation of such sequences in regex-opt (a switch would be nice though!), so this isn't really an issue for me. I think for most uses the current mapping is fine, if you make the switch that'd be cool, other than that I would recommend to simply see what other users think. What's important to me is mostly the new charset class and support for \x{....} escapes in in- and output. HTH, -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 9 January 2006 13:58:58 UTC