Re: What does "kanji" mean? from Martin Duerst on 2005-03-01 (www-forms-editor@w3.org from March 2005)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 01 Mar 2005 19:58:05 +0900
To: MURATA Makoto <EB2M-MRT@asahi-net.or.jp>, www-forms-editor@w3.org
Cc: Masayasu Ishikawa <mimasa@w3.mag.keio.ac.jp>
Message-Id: <6.0.0.20.2.20050301193756.0773dd80@localhost>

Hello Makoto,

Masayasu pointed me to this mail, and I decided that it's
easiest to reply directly.

At 21:00 05/02/28, MURATA Makoto wrote:
 > Dear colleagues,
 > I am writing this mail on behalf of a group trying to translate
 > the XForms rec to Japanese and publish the translation as a JIS TS.
 > In E.3.1 Script Tokens of the XForms recommendation, we
 > find a script name "kanji".  It is defined as
 > 	Subset of 'han' used in writing Japanese
 > However, we do not understand what is meant by this definition.

Do you mean you had problems understanding the text? Or do you mean
that there is no operational definition that unambiguously decides,
for each Han character, whether it's in this subset or not? I'm
assuming the later.

 > We examined relevant documents (shown below) but could not find any
 > definitions.
 > 	Unicode Character Database
 > 	Unicode Standard Annex #24Script Names
 > 	ISO15924
 > 	java.lang  Class Character.UnicodeBlock
 > If this definition cannot be clarified, we propose to drop this
 > script name.

The XForms spec does not require an unambiguous definition
of a script token. Section E.3.1 Script Tokens
(http://www.w3.org/TR/xforms/sliceE.html#mode-scripts) says:

 >>>>>>>>
However, this neither means that an input mode has to allow input for all 
the characters in the script or block, nor that an input mode is limited to 
only characters from that specific script. As an example, a "latin" 
keyboard doesn't cover all the characters in the Latin script, and includes 
punctuation which is not assigned to the Latin script.
 >>>>>>>>

So even if the definition of 'kanji' is very fuzzy, the specification will
still work. Indeed, it is important to realize that characters get added to
scripts, and different keyboards and input methods support different sets
of characters. For example, a mobile phone may not allow the input of the
same number of characters as a PC, for the same script.

I don't think this needs clarification in the spec, but in case a
clarification is desired, I propose to change the first sentence
above as follows:

 >>>>
However, this neither means that an input mode has to allow input for all 
the characters in the script or block, nor that an input mode is limited to 
only characters from that specific script, nor that all of the script tokens
refer to an exactly defined set of characters.
 >>>>

If there is one thing one can criticize about the script token 'kanji',
then it's that because Japanese input is mostly done via (hira)kana,
the use of this script token will be very rare. I think we included
it mainly to be ready just in case a different input technology for
Japanese becomes more popular on some devices, e.g. handwriting input
or some such, and that would benefit from distinguishing between
kana and kanji input methods. But such a thing may or may not happen.
So for the moment, 'kanji' is indeed not very useful, but it
also doesn't hurt.

Regards,    Martin.

Received on Tuesday, 1 March 2005 10:58:49 UTC