- From: McDonald, Ira <imcdonald@sharplabs.com>
- Date: Tue, 15 Feb 2000 10:52:07 -0800
- To: "'Stuart Woodward'" <stuart@gol.com>, nelocsig@egroups.com, www <www-international@w3.org>
Hi Stuart et al, Note that the Unicode Standard v2.0 (and higher) in section 3.6 'Decomposition' defines 'Compatibility Decomposition' (as opposed to 'Canonical Decomposition'). Using the rules for 'Compatibility Decomposition', a *binary* comparison between half-width and full-width Japanese characters *will* yield a match. Also see the excellent UTR-15 (Unicode Technical Report) 'Unicode Normalization Forms' (11 November 1999) by Mark Davis and Martin Duerst at: http://www.unicode.org/unicode/reports/tr15 Cheers, - Ira McDonald (consulting architect at Sharp Labs America) High North Inc -----Original Message----- From: Stuart Woodward [mailto:stuart@gol.com] Sent: Monday, February 14, 2000 11:50 PM To: nelocsig@egroups.com; www Subject: Re: [nelocsig] Re: International Search Engine Submission > Could you please explain the difference between "hankaku" and "zenkaku". In Shift JIS (and "Uni"code!) for katakana (phonetic) characters (& alphanumerics) there are two *different* character codes which represent the same chararcter. E.g. the word te-ri-bi (television) can be written in either hankaku (han=half width, single byte) or zenkaku (zen=full width, double byte) katakana. This is a holdover from the hardware word processor world which could only print in two sizes. So, if you search for "teribi" in half width characters you may not get any hits for pages which wrote it in full width characters even though to the reader they are the same word. It's bit like if a search engine was case sensitive. Some search engines do the conversion for you, some don't. See also: http://cns-web.bu.edu/pub/djohnson/web_files/i18n/japanese.html
Received on Tuesday, 15 February 2000 14:04:01 UTC