W3C home > Mailing lists > Public > www-international@w3.org > January to March 2000

RE: [nelocsig] Re: International Search Engine Submission

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Tue, 15 Feb 2000 10:52:07 -0800
Message-ID: <1115A7CFAC25D311BC4000508B2CA53730FF4B@MAILSRVNT02>
To: "'Stuart Woodward'" <stuart@gol.com>, nelocsig@egroups.com, www <www-international@w3.org>
Hi Stuart et al,

Note that the Unicode Standard v2.0 (and higher) in section
3.6 'Decomposition' defines 'Compatibility Decomposition'
(as opposed to 'Canonical Decomposition').  Using the rules
for 'Compatibility Decomposition', a *binary* comparison
between half-width and full-width Japanese characters
*will* yield a match.

Also see the excellent UTR-15 (Unicode Technical Report)
'Unicode Normalization Forms' (11 November 1999) by 
Mark Davis and Martin Duerst at:

http://www.unicode.org/unicode/reports/tr15

Cheers,
- Ira McDonald (consulting architect at Sharp Labs America)
  High North Inc

-----Original Message-----
From: Stuart Woodward [mailto:stuart@gol.com]
Sent: Monday, February 14, 2000 11:50 PM
To: nelocsig@egroups.com; www
Subject: Re: [nelocsig] Re: International Search Engine Submission


> Could you please explain the difference between "hankaku" and "zenkaku".

In Shift JIS (and "Uni"code!) for katakana (phonetic) characters (&
alphanumerics) there are two *different* character codes which represent the
same chararcter. E.g. the word te-ri-bi (television) can be written in
either hankaku (han=half width, single byte) or zenkaku (zen=full width,
double byte) katakana. This is a holdover
from the hardware word processor world which could only print in two sizes.

So, if you search for "teribi" in half width characters you may not get any
hits for pages which wrote it in full width characters even though to the
reader they are the same word. It's bit like if a search engine was case
sensitive.
Some search engines do the conversion for you, some don't.

See also:

http://cns-web.bu.edu/pub/djohnson/web_files/i18n/japanese.html
Received on Tuesday, 15 February 2000 14:04:01 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT